Science Museum Group Datasets
9th April 2025
We have made available below a number of bulk exports and datasets for academic or personal research use.
Hopefully these datasets will eliminate the requirement to bulk query our API, which in response to overly aggressive crawling we had to heavily rate limit.
If you use one of our data sets for your project we love to hear about it, it’s always interesting to hear about the interesting projects and uses of our data and helps us justify our time and investment in maintaining them.
Hopefully the data sets will be somewhat self-describing; the best was to get a handle on their structure and the data they contain is to compare a small sub-set of them with their matching online record.
Using our Data
Metadata
Data in the title, made, maker and details fields are released under a Creative Commons Zero licence.
Descriptions and textual content
Descriptions and all other text content are licensed under a Creative Commons Attribution 4.0 licence.
Attribution should be made to ‘© The Board of Trustees of the Science Museum’ along with a link back to the Science Museum Collection website https://collection.sciencemuseumgroup.org.uk
Images
Each image referenced within the dataset has it’s own copyright
"legal": {
"rights": [
{
"licence": "CC BY-NC-SA 4.0",
"copyright": "© The Board of Trustees of the Science Museum"
}
]
},
"credit": { "value": "Science Museum Group Collection" }
The full path of an image can be constructed by appending the path component /241/152/large_thumbnail_1901_0006__0004_.jpg
of an multimedia location:
element to the base url for our image server https://coimages.sciencemuseumgroup.org.uk. The medium_thumbnail
(~240px240px) is likely ideal for the purposes of training basic ML models.
Pleased do not unnecessarily download the larger image sizes in bulk or you may be rate limited.
Alternatively, you can download a single dataset containing all the medium_thumbnail
images as a single .zip file, in which case you will simply need to match up the image paths to wherever you placed that image folder locally.
smg_all_medium_thumbnail_images_09_04_2025.zip (1.19GB)
Images must not be harvested from our IIIF and Zoom endpoints, we will block access and may act again anyone doing so.
Copyright of images
Please check and only use image which fall under on of ther following three licence and attribute accordingly.
Always check the licence
Many records contain more than one image and while the first image may be made available under an open source licence that does not necessarily hold true for all other image on that record. Always check the licence for each individual image.
Attribution
Attribution should be made to ‘© The Board of Trustees of the Science Museum’ along with a link back to the Science Museum Collection website https://collection.sciencemuseumgroup.org.uk where applicable.
Sources of data
Our collection of 7 million items has been added to and documented since 1851. Some historic records within the collection may contain illustrations and descriptions that are offensive or reflect outdated ideas and analysis. This material does not reflect our values as an organisation that is Open for All.
We are actively improving these historic records, including by adding context, correcting errors, changing language and calling out offensive characterisations. Sometimes we have retained original content as evidence of historic racism.
Notice & Takedown Policy
The Science Museum Group has made all reasonable efforts to ensure that images and other content on our collection website are reproduced with the consent of their copyright holders. However, there are some copyright holders who so far have proved to be untraceable. For more information, please view our Notice & Takedown Policy.
SMG Collection website
If you are simply looking to browse our collection online, you can find material published from our object and archival collections on the Science Museum Group collection website. We have over 500,000 objects and 50,000 archival documents published online.
Datasets
The following datasets are made avaliable for academic or personal research use.
JSON
Note: Some of the following files are .zip compressed.
Object records
These are by and large what you think of as physical objects. Trains, planes and automobiles, household appliances, TV sets, typewriters, models, command modules and so much more.
- smg_object_records_with_CC_images_09_04_2025.json.zip (150,355 records) 145MB
- smg_object_records_all_09_04_2025.json.zip (525,595 records) 322.8MB
Documents records
These are largely 2D paper and print based objects in our collection ie. archives, plans, drawings, diarys, books, maps, trade journals.
- smg_document_records_with_CC_images_09_04_2025.json: (7017 records) 159.2MB
- smg_document_records_all_09_04_2025.json (77,409 records) 560.8MB
People and Company records
These are the people and company records related to the objects and documents we have published online and made available in the above datasets.
- smg_people_and_company_records_09_04_2025.json (23,684 records) 61.4MB
Note #1: There may be duplication and overlap in these records, check @admin.source
if you want to de-dupe the dataset yourself.
Note #2: There is currently no means to easily distinguish between people and company records. Hence the somewhat loose naming of this export.
CSV
These are simplified CSV files containing a subset of key-fields.
There is quite a lot of granular nested detail in our records, depending on your use-case you may want to spend some time carefully parsing out and cleaning specific fields from the JSON files.
Downloads
- smg_object_records_with_CC_images_09_04_2025.csv (150,344 records) 48.9MB
- smg_object_records_all_09_04_2025.csv (525,595 records) 135.8
Field names
- uid (a URL can be constructed from this https://collection.sciencemuseumgroup.org.uk/objects/co8538830)
- identifier (SMG accession number)
- title
- description
- category (a high level museum category)
- material
- object_name (a basic taxonomy classification, some overlap with Getty ATT),
- date (made)
- place (made)
- maker
- image (a medium size image; ties up with image data set below)
Images
A single .zip file containing all the medium sized thumbnails referenced in the files above.
Useful code snippets
Load a JSON file in as a Panda Data frame
import pandas as pd
import json
try:
with open('smg_object_records_with_CC_images_09_04_2025.json', 'r', encoding='utf-8') as f:
data = json.load(f)
print(f"JSON contains {len(data)} top-level records")
df = pd.json_normalize(data)
print("\nData loaded successfully. First few rows:")
print(df.head())
except Exception as e:
print(f"Error loading file: {e}")
df.info()
Examine the nested fields in the first record
print("\nNested fields in first record:")
for col in nested_fields:
print(f"\nField: {col}")
print(df[col].iloc[0])
Compare this structure to a website record:
- Rocket Locomotive - object (HTML / JSON)
- Robert Stephenson - person (HTML / JSON)
- Side elevation of the Rocket locomotive engine - document (HTML / JSON
Create a simple CSV file from the larger nested JSON exports:
Note: This can take a long time to process
There is quite a lot of granular nested detail in the records. Depending on your use-case you may want to spend some time carefully parsing out and cleaning specific fields from the JSON. The following is provided as an example only.
import pandas as pd
import json
def extract_primary_value(field):
"""Extract primary value from a list of dictionaries"""
if isinstance(field, list) and len(field) > 0:
for item in field:
if isinstance(item, dict) and item.get('primary', False):
return item.get('value', '')
return field[0].get('value', '') if isinstance(field[0], dict) else ''
return field
def extract_materials(field):
"""Extract semicolon-separated materials/names"""
if isinstance(field, list):
return "; ".join([item.get('value', '') for item in field if isinstance(item, dict)])
return field
def extract_image_url(field):
"""Extract medium thumbnail image path"""
if isinstance(field, list) and len(field) > 0:
first_item = field[0]
if isinstance(first_item, dict) and '@processed' in first_item:
return first_item['@processed'].get('medium_thumbnail', {}).get('location', '')
return ''
def process_dataframe(df):
"""Process the DataFrame to extract required fields"""
processed_df = pd.DataFrame()
# Extract and transform each field
processed_df['uid'] = df['@admin.uid']
processed_df['identifier'] = df['identifier'].apply(extract_primary_value)
processed_df['title'] = df['title'].apply(extract_primary_value)
processed_df['description'] = df['description'].apply(extract_primary_value)
processed_df['category'] = df['category'].apply(lambda x: x[0]['name'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) else x)
processed_df['material'] = df['material'].apply(extract_materials)
processed_df['object_name'] = df['name'].apply(extract_materials)
processed_df['date'] = df['creation.date'].apply(extract_primary_value)
# Extract place (first summary.title)
processed_df['place'] = df['creation.place'].apply(
lambda x: x[0]['summary']['title'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) and 'summary' in x[0] else ''
)
# Extract maker (first summary.title)
processed_df['maker'] = df['creation.maker'].apply(
lambda x: x[0]['summary']['title'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) and 'summary' in x[0] else ''
)
# Extract image URL
processed_df['image'] = df['multimedia'].apply(extract_image_url)
return processed_df
def main():
# Load JSON file
input_file = 'smg_object_records_with_CC_images_09_04_2025.json' # Change to your input file path
output_file = 'smg_object_records_with_CC_images_09_04_2025.csv' # Change to your desired output file path
try:
# Read JSON file
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
# Create DataFrame (assuming JSON is a list of records)
df = pd.json_normalize(data)
# Process the DataFrame
processed_df = process_dataframe(df)
# Export to CSV
processed_df.to_csv(output_file, index=False, columns=[
'uid', 'identifier', 'title', 'description', 'category',
'material', 'object_name', 'date', 'place', 'maker', 'image'
])
print(f"Successfully processed and saved to {output_file}")
# Print example row
print("\nExample row:")
print('uid:', processed_df.loc[1, 'uid'])
print('identifier:', processed_df.loc[1, 'identifier'])
print('title:', processed_df.loc[1, 'title'])
print('description:', processed_df.loc[1, 'description'])
print('category:', processed_df.loc[1, 'category'])
print('material:', processed_df.loc[1, 'material'])
print('object_name:', processed_df.loc[1, 'object_name'])
print('date:', processed_df.loc[1, 'date'])
print('place:', processed_df.loc[1, 'place'])
print('maker:', processed_df.loc[1, 'maker'])
print('image:', processed_df.loc[1, 'image'])
except Exception as e:
print(f"Error processing file: {str(e)}")
if __name__ == "__main__":
main()