Glue
AWS Glue ETL (Extract, Transform, Load) service and Data Catalog for managing databases, tables, crawlers, and jobs.
Configuration
| Property | Value |
|---|---|
| Protocol | AwsJson1_1 |
| Signing Name | glue |
| Target Prefix | AWSGlue |
| Persistence | No |
Quick Start
Create a Glue database, add a table, and define a crawler:
# Create a database in the Glue Data Catalog
curl -s http://localhost:4566 \
-H "Content-Type: application/x-amz-json-1.1" \
-H "X-Amz-Target: AWSGlue.CreateDatabase" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test/20260421/us-east-1/glue/aws4_request, SignedHeaders=host, Signature=fake" \
-d '{"DatabaseInput":{"Name":"analytics","Description":"Analytics data catalog","LocationUri":"s3://my-data-bucket/"}}'
# Create a table in that database
curl -s http://localhost:4566 \
-H "Content-Type: application/x-amz-json-1.1" \
-H "X-Amz-Target: AWSGlue.CreateTable" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test/20260421/us-east-1/glue/aws4_request, SignedHeaders=host, Signature=fake" \
-d '{"DatabaseName":"analytics","TableInput":{"Name":"events","Description":"User events","StorageDescriptor":{"Columns":[{"Name":"user_id","Type":"string"},{"Name":"event_type","Type":"string"},{"Name":"timestamp","Type":"bigint"}],"Location":"s3://my-data-bucket/events/","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}}}}'Operations
Databases
CreateDatabase— create a database in the Glue Data Catalog- Input:
DatabaseInputobject withName(required),Description,LocationUri,Parameters - Returns: empty response (HTTP 200)
- Input:
GetDatabase— get a specific database by name- Input:
Name - Returns:
DatabasewithName,Description,LocationUri,CreateTime
- Input:
GetDatabases— list all databases in the catalog- Input: optional
NextToken,MaxResults - Returns: paginated
DatabaseList
- Input: optional
DeleteDatabase— delete a database and optionally its tables- Input:
Name
- Input:
UpdateDatabase— update database properties- Input:
Name,DatabaseInput
- Input:
Tables
CreateTable— create a table in a Glue database- Input:
DatabaseName,TableInput(withName,StorageDescriptorcontainingColumns,Location,InputFormat,SerdeInfo) - Returns: empty response
- Input:
GetTable— get a specific table by database and name- Input:
DatabaseName,Name - Returns:
Tablewith full schema includingStorageDescriptor
- Input:
GetTables— list tables in a database- Input:
DatabaseName, optionalNextToken,MaxResults - Returns: paginated
TableList
- Input:
DeleteTable— delete a table- Input:
DatabaseName,Name
- Input:
UpdateTable— update table schema or properties- Input:
DatabaseName,TableInput
- Input:
Crawlers
CreateCrawler— create a crawler to discover and catalog data sources- Input:
Name(required),Role(IAM role ARN),DatabaseName,Targets({S3Targets: [{Path: "s3://..."}]}) - Returns: empty response; crawler starts in
READYstate
- Input:
GetCrawler— get crawler details and current state- Input:
Name - Returns:
CrawlerwithName,State(READY,RUNNING,STOPPING),LastCrawl
- Input:
GetCrawlers— list all crawlersStartCrawler— start a crawler run- Input:
Name - Transitions:
READY→RUNNING→READY
- Input:
StopCrawler— stop a running crawler- Input:
Name
- Input:
DeleteCrawler— delete a crawler
Tables (extended)
SearchTables— search tables by substring match on name or database name- Input:
SearchText(substring), optionalFiltersarray - Returns:
TableList
- Input:
Partitions
GetPartitions— list partitions for a table- Input:
DatabaseName,TableName - Returns:
Partitionslist withValues,StorageDescriptor,CreationTime
- Input:
CreatePartition— create a partition- Input:
DatabaseName,TableName,PartitionInputwithValuesarray and optionalStorageDescriptor
- Input:
DeletePartition— delete a partition by values- Input:
DatabaseName,TableName,PartitionValuesarray
- Input:
BatchCreatePartition— create multiple partitions in one call- Input:
DatabaseName,TableName,PartitionInputList - Returns:
Errorslist for any failed partitions
- Input:
BatchDeletePartition— delete multiple partitions in one call- Input:
DatabaseName,TableName,PartitionsToDelete - Returns:
Errorslist for any not-found partitions
- Input:
Crawlers (extended)
UpdateCrawler— update crawler configuration (role, targets, schedule, description)- Input:
Name, plus any of:Role,DatabaseName,Targets,Schedule,Description
- Input:
GetCrawlerMetrics— returns empty metrics list (stub)GetClassifier/GetClassifiers— returns not-found / empty list (no classifier storage)
Jobs
CreateJob— create an ETL job definition- Input:
Name(required),Role(IAM role ARN),Command({Name: "glueetl", ScriptLocation: "s3://..."}) - Returns:
Name
- Input:
GetJob— get job details by name- Input:
JobName - Returns:
JobwithName,Role,Command,MaxCapacity
- Input:
GetJobs— list all ETL jobsDeleteJob— delete a job definitionBatchGetJobs— get multiple jobs by name list- Input:
JobNamesarray - Returns:
Jobs(found) andJobsNotFoundarrays
- Input:
Job Runs
StartJobRun— start a job run; immediately markedSUCCEEDEDin the emulator- Input:
JobName, optionalArguments - Returns:
JobRunId
- Input:
GetJobRun— get status of a specific run- Input:
JobName,RunId - Returns:
JobRunwithId,JobRunState,StartedOn,CompletedOn
- Input:
GetJobRuns— list all runs for a job- Input:
JobName - Returns:
JobRunslist
- Input:
BatchStopJobRun— stop multiple job runs- Input:
JobName,JobRunIdsarray - Returns:
SuccessfulSubmissionsandErrors
- Input:
Connections
CreateConnection— create a Glue connection (JDBC, S3, etc.)- Input:
ConnectionInputwithName,ConnectionType,ConnectionProperties
- Input:
GetConnections— list all connections- Returns:
ConnectionList
- Returns:
DeleteConnection— delete a connection by name- Input:
ConnectionName
- Input:
Tags
GetTags— get tags for a resource ARN- Input:
ResourceArn - Returns:
Tagsmap
- Input:
TagResource— add tags to a resource- Input:
ResourceArn,TagsToAddmap
- Input:
UntagResource— remove tags from a resource- Input:
ResourceArn,TagsToRemovearray of keys
- Input:
Curl Examples
# 1. List all databases
curl -s http://localhost:4566 \
-H "Content-Type: application/x-amz-json-1.1" \
-H "X-Amz-Target: AWSGlue.GetDatabases" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test/20260421/us-east-1/glue/aws4_request, SignedHeaders=host, Signature=fake" \
-d '{}'
# 2. Create an ETL job
curl -s http://localhost:4566 \
-H "Content-Type: application/x-amz-json-1.1" \
-H "X-Amz-Target: AWSGlue.CreateJob" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test/20260421/us-east-1/glue/aws4_request, SignedHeaders=host, Signature=fake" \
-d '{"Name":"events-etl","Role":"arn:aws:iam::000000000000:role/GlueRole","Command":{"Name":"glueetl","ScriptLocation":"s3://my-scripts/transform.py","PythonVersion":"3"},"MaxCapacity":2.0}'
# 3. Start a crawler
curl -s http://localhost:4566 \
-H "Content-Type: application/x-amz-json-1.1" \
-H "X-Amz-Target: AWSGlue.StartCrawler" \
-H "Authorization: AWS4-HMAC-SHA256 Credential=test/20260421/us-east-1/glue/aws4_request, SignedHeaders=host, Signature=fake" \
-d '{"Name":"my-crawler"}'SDK Example
import {
GlueClient,
CreateDatabaseCommand,
CreateTableCommand,
CreateCrawlerCommand,
GetTablesCommand,
} from '@aws-sdk/client-glue';
const glue = new GlueClient({
region: 'us-east-1',
endpoint: 'http://localhost:4566',
credentials: { accessKeyId: 'test', secretAccessKey: 'test' },
});
// Create database
await glue.send(new CreateDatabaseCommand({
DatabaseInput: {
Name: 'analytics',
Description: 'Analytics data catalog',
},
}));
// Create table with schema
await glue.send(new CreateTableCommand({
DatabaseName: 'analytics',
TableInput: {
Name: 'events',
StorageDescriptor: {
Columns: [
{ Name: 'user_id', Type: 'string' },
{ Name: 'event_type', Type: 'string' },
{ Name: 'created_at', Type: 'timestamp' },
{ Name: 'metadata', Type: 'map<string,string>' },
],
Location: 's3://my-data-bucket/events/',
InputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
SerdeInfo: {
SerializationLibrary: 'org.openx.data.jsonserde.JsonSerDe',
Parameters: { 'serialization.format': '1' },
},
},
},
}));
// List tables
const { TableList } = await glue.send(new GetTablesCommand({
DatabaseName: 'analytics',
}));
console.log('Tables:', TableList?.map(t => t.Name));
// Create crawler
await glue.send(new CreateCrawlerCommand({
Name: 'data-crawler',
Role: 'arn:aws:iam::000000000000:role/GlueRole',
DatabaseName: 'analytics',
Targets: {
S3Targets: [{ Path: 's3://my-data-bucket/' }],
},
}));Behavior Notes
- Glue in AWSim manages catalog metadata (databases, tables, crawlers, jobs, connections) but does not execute ETL code or run actual crawl jobs.
StartCrawlertransitions the crawler stateREADY→RUNNING→READYquickly (simulated) but does not discover or catalog any data from S3 or other sources.StartJobRunimmediately creates a run with statusSUCCEEDED— no actual code executes.- Partitions are stored on the parent table; all partition operations are in-memory.
- The Glue Data Catalog is shared across services — Athena references the same catalog when listing databases.
- State is in-memory only and lost on restart.