GitHub Runners¶

Overview¶

This specification defines the GitHub Actions self-hosted runner fleet implementation for azlin, enabling VM pools to function as auto-scaling CI/CD runners.

Architecture¶

Component Diagram¶

CLI Command (azlin github-runner enable)
    ↓
RunnerLifecycleManager (Orchestrator)
    ↓
    ├─→ RunnerProvisioner (GitHub API Integration)
    ├─→ JobQueueMonitor (Queue Metrics)
    ├─→ AutoScaler (Scaling Decisions)
    └─→ VMProvisioner (Existing - VM Lifecycle)

Module 1: RunnerProvisioner¶

File: src/azlin/modules/github_runner_provisioner.py

Purpose: Handle GitHub Actions runner registration/deregistration via REST API

Data Models¶

@dataclass
class RunnerConfig:
    """Configuration for a GitHub Actions runner."""
    repo_owner: str
    repo_name: str
    runner_name: str
    labels: list[str]
    runner_group: str | None = None

@dataclass
class RunnerRegistration:
    """Details of a registered runner."""
    runner_id: int
    runner_name: str
    registration_token: str
    token_expires_at: datetime

@dataclass
class RunnerInfo:
    """Runtime information about a runner."""
    runner_id: int
    runner_name: str
    status: Literal["online", "offline"]
    busy: bool
    labels: list[str]

Public Interface¶

class GitHubRunnerProvisioner:
    """Manage GitHub Actions runner registration via API."""

    @classmethod
    def get_registration_token(
        cls,
        repo_owner: str,
        repo_name: str,
        github_token: str
    ) -> str:
        """Get registration token from GitHub API.

        API: POST /repos/{owner}/{repo}/actions/runners/registration-token
        Token expires after 1 hour.
        """

    @classmethod
    def register_runner(
        cls,
        ssh_config: SSHConfig,
        config: RunnerConfig,
        registration_token: str
    ) -> int:
        """Register runner on VM and return runner ID.

        Steps:
        1. Download runner binary if needed
        2. Run ./config.sh with token and labels
        3. Extract runner ID from response
        4. Start runner service
        """

    @classmethod
    def deregister_runner(
        cls,
        repo_owner: str,
        repo_name: str,
        runner_id: int,
        github_token: str
    ) -> None:
        """Remove runner from GitHub.

        API: DELETE /repos/{owner}/{repo}/actions/runners/{runner_id}
        """

    @classmethod
    def get_runner_info(
        cls,
        repo_owner: str,
        repo_name: str,
        runner_id: int,
        github_token: str
    ) -> RunnerInfo:
        """Get current runner status.

        API: GET /repos/{owner}/{repo}/actions/runners/{runner_id}
        """

API Endpoints Used¶

POST /repos/{owner}/{repo}/actions/runners/registration-token
DELETE /repos/{owner}/{repo}/actions/runners/{runner_id}
GET /repos/{owner}/{repo}/actions/runners/{runner_id}

Security Requirements¶

HTTPS only for API calls
Token validation before operations
No token storage (environment variable only)
Input sanitization for repo names and labels
Timeout on API calls (30 seconds)

Error Handling¶

class RunnerProvisioningError(Exception):
    """Base error for runner provisioning."""

class RegistrationTokenError(RunnerProvisioningError):
    """Failed to get registration token."""

class RunnerRegistrationError(RunnerProvisioningError):
    """Failed to register runner."""

class RunnerDeregistrationError(RunnerProvisioningError):
    """Failed to deregister runner."""

Module 2: JobQueueMonitor¶

File: src/azlin/modules/github_queue_monitor.py

Purpose: Monitor GitHub Actions job queue depth for scaling decisions

Data Models¶

@dataclass
class QueueMetrics:
    """Metrics about the GitHub Actions job queue."""
    pending_jobs: int
    in_progress_jobs: int
    queued_jobs: int
    total_jobs: int
    timestamp: datetime

    @property
    def needs_scaling(self) -> bool:
        """Quick check if scaling might be needed."""
        return self.pending_jobs > 0 or self.queued_jobs > 0

Public Interface¶

class GitHubJobQueueMonitor:
    """Monitor GitHub Actions job queue."""

    @classmethod
    def get_queue_metrics(
        cls,
        repo_owner: str,
        repo_name: str,
        labels: list[str] | None,
        github_token: str
    ) -> QueueMetrics:
        """Get current queue metrics for repository.

        API: GET /repos/{owner}/{repo}/actions/runs?status=queued
        API: GET /repos/{owner}/{repo}/actions/runs?status=in_progress

        If labels provided, filter jobs by runner labels.
        """

    @classmethod
    def get_pending_job_count(
        cls,
        repo_owner: str,
        repo_name: str,
        labels: list[str] | None,
        github_token: str
    ) -> int:
        """Get count of pending jobs (convenience method)."""

API Endpoints Used¶

GET /repos/{owner}/{repo}/actions/runs?status=queued
GET /repos/{owner}/{repo}/actions/runs?status=in_progress
GET /repos/{owner}/{repo}/actions/runs?per_page=100

Filtering Logic¶

Jobs are filtered by checking workflow run labels against runner labels: - If no labels specified: count all jobs - If labels specified: count jobs where ALL labels match

Module 3: AutoScaler¶

File: src/azlin/modules/github_runner_autoscaler.py

Purpose: Make intelligent scaling decisions based on queue metrics

Data Models¶

@dataclass
class ScalingConfig:
    """Configuration for autoscaling behavior."""
    min_runners: int = 0
    max_runners: int = 10
    jobs_per_runner: int = 2  # Target ratio
    scale_up_threshold: int = 2  # Pending jobs to trigger scale up
    scale_down_threshold: int = 0  # Idle runners to trigger scale down
    cooldown_seconds: int = 300  # Wait between scaling actions

@dataclass
class ScalingDecision:
    """Decision about scaling action."""
    action: Literal["scale_up", "scale_down", "maintain"]
    target_runner_count: int
    current_runner_count: int
    reason: str

Public Interface¶

class GitHubRunnerAutoScaler:
    """Make scaling decisions for runner fleet."""

    @classmethod
    def calculate_scaling_decision(
        cls,
        queue_metrics: QueueMetrics,
        current_runner_count: int,
        config: ScalingConfig,
        last_scaling_action: datetime | None = None
    ) -> ScalingDecision:
        """Calculate scaling decision based on metrics.

        Logic:
        1. Check cooldown period
        2. Calculate target runners = pending_jobs / jobs_per_runner
        3. Apply min/max constraints
        4. Determine action (up/down/maintain)
        """

Scaling Algorithm¶

target_runners = ceil(pending_jobs / jobs_per_runner)
target_runners = max(min_runners, min(max_runners, target_runners))

if target_runners > current_runners + scale_up_threshold:
    action = "scale_up"
elif target_runners < current_runners - scale_down_threshold:
    action = "scale_down"
else:
    action = "maintain"

Module 4: RunnerLifecycleManager¶

File: src/azlin/modules/github_runner_lifecycle.py

Purpose: Orchestrate complete ephemeral runner lifecycle

Data Models¶

@dataclass
class RunnerLifecycleConfig:
    """Configuration for runner lifecycle management."""
    runner_config: RunnerConfig
    vm_config: VMConfig
    github_token: str
    max_job_count: int = 1  # Ephemeral: 1 job per runner
    rotation_interval_hours: int = 24

@dataclass
class EphemeralRunner:
    """Details of an ephemeral runner."""
    vm_details: VMDetails
    runner_id: int
    runner_name: str
    created_at: datetime
    jobs_completed: int
    status: Literal["provisioning", "registered", "active", "draining", "destroyed"]

Public Interface¶

class GitHubRunnerLifecycleManager:
    """Manage complete runner lifecycle."""

    @classmethod
    def provision_ephemeral_runner(
        cls,
        config: RunnerLifecycleConfig
    ) -> EphemeralRunner:
        """Provision new ephemeral runner.

        Steps:
        1. Provision VM using VMProvisioner
        2. Get registration token
        3. Register runner on VM
        4. Configure as ephemeral (--ephemeral flag)
        5. Start runner service
        """

    @classmethod
    def destroy_runner(
        cls,
        runner: EphemeralRunner,
        config: RunnerLifecycleConfig
    ) -> None:
        """Destroy ephemeral runner.

        Steps:
        1. Stop runner service on VM
        2. Deregister from GitHub
        3. Delete VM
        """

    @classmethod
    def rotate_runner(
        cls,
        old_runner: EphemeralRunner,
        config: RunnerLifecycleConfig
    ) -> EphemeralRunner:
        """Rotate runner for security.

        Steps:
        1. Provision new runner
        2. Wait for new runner to be ready
        3. Destroy old runner
        """

    @classmethod
    def check_runner_health(
        cls,
        runner: EphemeralRunner,
        github_token: str
    ) -> bool:
        """Check if runner is healthy."""

Module 5: CLI Integration¶

File: src/azlin/commands/github_runner.py

Purpose: Provide CLI commands for runner management

Commands¶

# Enable runner fleet on a pool
azlin github-runner enable \
    --repo owner/repo \
    --pool ci-workers \
    --labels "linux,docker" \
    --min-runners 0 \
    --max-runners 10

# Disable runner fleet
azlin github-runner disable --pool ci-workers

# Show runner fleet status
azlin github-runner status --pool ci-workers

# Scale runner fleet manually
azlin github-runner scale --pool ci-workers --count 5

Configuration Storage¶

Configuration stored in azlin config file:

{
  "github_runner_fleets": {
    "ci-workers": {
      "repo_owner": "myorg",
      "repo_name": "myrepo",
      "labels": ["linux", "docker"],
      "min_runners": 0,
      "max_runners": 10,
      "enabled": true
    }
  }
}

Testing Strategy¶

Unit Tests¶

Each module has comprehensive unit tests:

RunnerProvisioner: Mock GitHub API responses
JobQueueMonitor: Mock API responses with various queue states
AutoScaler: Test scaling logic with different scenarios
RunnerLifecycleManager: Mock VM provisioning and API calls

Integration Tests¶

Test end-to-end flows with mocked external services:

Provision ephemeral runner flow
Auto-scaling up flow
Auto-scaling down flow
Runner rotation flow

Manual Testing¶

CLI commands tested manually:

Enable fleet on test pool
Verify runners register with GitHub
Trigger workflow and verify job execution
Verify auto-scaling behavior
Disable fleet and verify cleanup

Security Considerations¶

Token Management¶

GitHub token from environment variable GITHUB_TOKEN
Never store tokens in config files
Tokens validated before use
Registration tokens used immediately and discarded

Input Validation¶

Repository names: ^[a-zA-Z0-9_-]+/[a-zA-Z0-9._-]+$
Pool names: ^[a-zA-Z0-9_-]+$
Labels: ^[a-zA-Z0-9._-]+$
Runner names: Unique identifiers generated

API Security¶

HTTPS only
Timeout on all requests (30 seconds)
Rate limit handling (exponential backoff)
Error messages sanitized (no token exposure)

Performance Considerations¶

Polling Intervals¶

Queue monitoring: Every 60 seconds
Runner health checks: Every 120 seconds
Scaling decisions: After each queue check

Concurrency¶

Use existing BatchExecutor for parallel operations
Maximum 10 concurrent provisioning operations
Graceful degradation on API rate limits

Dependencies¶

External¶

requests: HTTP client for GitHub API
Existing azlin modules: vm_provisioning, ssh_connector, config_manager

Python Version¶

Python 3.11+ (existing azlin requirement)

Acceptance Criteria¶

✓ Register VMs as GitHub Actions runners
✓ Auto-scaling based on job queue depth
✓ Ephemeral runners (per-job lifecycle)
✓ Runner rotation for security
✓ CLI command azlin github-runner enable --repo owner/repo --pool pool-name
✓ No credential storage (token from environment)
✓ Comprehensive test coverage (>80%)
✓ Philosophy compliance (simplicity, modularity, zero-BS)