Infrastructure Automation with Bash and Python in Hybrid Environments

In a homogeneous Kubernetes-on-Linux stack, scripting feels like a legacy concern. In a real enterprise environment — mixed Linux distros, some Windows servers, a few air-gapped sites, vendor appliances with their own quirks — scripting is what holds the operational layer together. Configuration management tools (Ansible, Salt, Chef) handle the structured cases; everything else is bash and Python doing the work that makes the rest possible.

This post is about doing that scripting well: which language to reach for, how to write scripts that survive contact with production, and what discipline turns ad-hoc automation into operational infrastructure.

When to Use Bash and When to Use Python

The rough division of labor that’s held up for decades:

Bash is good at:

Orchestrating other commands (gluing CLIs together).
File and process operations.
Short scripts under ~100 lines.
Environments where Python isn’t available or convenient.
Quick one-shots that won’t be maintained.

Python is good at:

Anything involving structured data (JSON, YAML, CSV).
HTTP APIs, cloud SDKs, retry/error handling.
Cross-platform scripts (Linux + Windows).
Anything that will be tested, versioned, and maintained.
Scripts longer than ~100 lines.

The decision rule that holds: if the script reads or writes structured data, use Python. If it pipes between command-line tools, use bash. The default for new automation in 2026 should be Python — the maintenance cost is lower, the testing story is better, and the cross-platform compatibility is built in.

Bash That Doesn’t Surprise You

The default behavior of bash is dangerous: unset variables expand to empty strings, failed commands continue silently, pipelines hide errors. The first three lines of any non-trivial bash script:

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

-e exits on any command failure.
-u errors on unset variables.
-o pipefail makes pipelines fail if any stage fails (default: only the last stage matters).
IFS=$'\n\t' prevents word-splitting on spaces, which catches the canonical “filename with space” bug.

These four lines eliminate the most common bash bugs. Use them by default.

A few more patterns worth internalizing:

# Quote all variables. Always.
rm -rf "${TMP_DIR:?TMP_DIR not set}"

# Use [[ ]] not [ ] for conditionals — handles spaces correctly.
if [[ -f "$file" ]]; then ...

# Check command success explicitly.
if ! systemctl restart nginx; then
  echo "nginx restart failed" >&2
  exit 1
fi

# Capture output and exit code.
output=$(some_command 2>&1) || {
  echo "command failed: $output" >&2
  exit 1
}

# Trap cleanup.
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT

# Use functions; pass arguments explicitly.
deploy_service() {
  local service="$1"
  local version="$2"
  ...
}

For anything longer than 50 lines, write functions, return exit codes explicitly, and check them. Bash doesn’t do well with error handling, but disciplined explicit checks turn it from “unpredictable” to “predictable when given the wrong input.”

ShellCheck Is Non-Negotiable

shellcheck catches the vast majority of common bash bugs statically. Run it in CI; treat its warnings as errors.

# CI step
shellcheck scripts/*.sh

shfmt for formatting. Same role bash plays for itself — automation around the automation.

Python for Operations

The Python version that ships with modern Linux distros (3.11+) has enough built-in tooling for most operational work without external dependencies. A working skeleton:

#!/usr/bin/env python3
"""Restart unhealthy nodes and report results."""

from __future__ import annotations

import argparse
import logging
import subprocess
import sys
import json
from dataclasses import dataclass
from pathlib import Path

logger = logging.getLogger(__name__)


@dataclass
class Node:
    name: str
    status: str


def list_nodes(kubeconfig: Path) -> list[Node]:
    result = subprocess.run(
        ["kubectl", "--kubeconfig", str(kubeconfig),
         "get", "nodes", "-o", "json"],
        capture_output=True, check=True, text=True, timeout=30,
    )
    data = json.loads(result.stdout)
    return [
        Node(name=n["metadata"]["name"],
             status=_ready_condition(n).get("status", "Unknown"))
        for n in data["items"]
    ]


def _ready_condition(node: dict) -> dict:
    for c in node["status"].get("conditions", []):
        if c["type"] == "Ready":
            return c
    return {}


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--kubeconfig", type=Path, required=True)
    parser.add_argument("--dry-run", action="store_true")
    parser.add_argument("-v", "--verbose", action="store_true")
    args = parser.parse_args()

    logging.basicConfig(
        level=logging.DEBUG if args.verbose else logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
    )

    try:
        nodes = list_nodes(args.kubeconfig)
    except subprocess.CalledProcessError as e:
        logger.error("kubectl failed: %s", e.stderr)
        return 1

    unhealthy = [n for n in nodes if n.status != "True"]
    if not unhealthy:
        logger.info("all nodes healthy")
        return 0

    for node in unhealthy:
        logger.warning("node %s is %s", node.name, node.status)
        if not args.dry_run:
            ...
    return 0


if __name__ == "__main__":
    sys.exit(main())

A few production patterns in evidence:

argparse from the start. Even if it has one argument now.
Logging, not print. Levels, timestamps, configurable verbosity.
subprocess.run with explicit check, timeout, text. Never bare subprocess.call or shell-string commands.
Exit codes from main. 0 for success, non-zero for distinct failure modes.
Type hints throughout. The script is now statically checkable.
--dry-run for anything that mutates. Every destructive script should have one.

Cross-Platform Concerns

A script that runs on Linux and Windows needs deliberate compatibility work. The patterns:

Use pathlib.Path, not string paths. Handles / vs. \ correctly.
Don’t shell out to commands that differ. os.cpu_count() instead of nproc; shutil.disk_usage instead of df.
Quote paths via shlex.quote (Linux) or subprocess.list2cmdline (Windows). Or pass arguments as a list and let subprocess handle it.
Newlines. \n vs. \r\n. Open files with newline="" for binary correctness when reading user data.
Environment variables. Be aware of PATH, HOMEPATH/HOME, TEMP/TMPDIR differences.
Encoding. Set encoding="utf-8" everywhere; don’t depend on system defaults.

For genuinely cross-platform operations, libraries like psutil (system info), paramiko / fabric (SSH), pywin32 (Windows-specific), and pywinrm (Windows remote management) cover the platform-specific surfaces.

Idempotency Is the Whole Game

A production automation script must be safe to run twice. The single most common operational bug is a script that does the right thing the first time and the wrong thing on retry.

Patterns:

Check before creating. if not file.exists(): file.touch().
Use mkdir -p not mkdir.
Use rm -f not rm.
State files for tracking progress. A long-running migration writes “step 7 complete” and resumes from step 7 on retry.
API-level idempotency. AWS SDKs accept idempotency tokens for many calls. Use them.

A script that says “I detected the desired state already exists; nothing to do” is correct; one that succeeds the first time and errors the second time is broken.

Secrets and Credentials

Hard-coded secrets in scripts are the original sin of automation. The hierarchy of correctness:

Best: short-lived credentials from a secrets manager. AWS Secrets Manager, Vault, SSM Parameter Store. Fetched at runtime, never persisted.
Acceptable: encrypted secrets injected at runtime. Environment variables from a secrets store, sealed-secrets, SOPS-encrypted files decrypted on use.
Discouraged: plaintext in environment variables or files. Better than committed in source, but still exposed.
Forbidden: secrets in source control. Even in private repos.

For cloud-resident automation, IAM Instance Roles / Pod Identity / Managed Identity are the right answer — the script never sees a credential; the runtime resolves identity from the surrounding environment.

import boto3
ssm = boto3.client("ssm")
secret = ssm.get_parameter(
    Name="/myorg/prod/api-token",
    WithDecryption=True,
)["Parameter"]["Value"]

Never log secrets, even at DEBUG. A common bug: a structured-logger context includes a “config” object that includes a secret field.

Logging and Output

Operational scripts produce two kinds of output: human-readable progress and machine-readable results. Don’t mix them.

stderr for progress, debug, warnings, errors.
stdout for the result (JSON for machine consumption).
A log file for full history.

This lets scripts compose: script1 | script2 works because stdout is structured; script1 2>&1 | tee logfile captures everything for an operator.

For multi-step operations, structured JSON logs to stderr (parsable by log aggregation) plus a human-readable summary at the end is the pattern that ages best.

Configuration

Configuration goes in one of three places, in order of preference:

CLI arguments for invocation-specific values.
Environment variables for environment-specific values (dev/staging/prod).
Config files for stable, structured config.

Don’t put configuration in the script itself. Don’t put secrets in configuration files. Use a library for config parsing (pydantic-settings is excellent for Python) that validates types and reports missing required values clearly.

Error Handling and Retries

Network calls, cloud APIs, and disk operations all fail transiently. A script that exits on first transient failure is operationally fragile.

Use a retry library (tenacity for Python) rather than handcrafted loops:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(ConnectionError),
    reraise=True,
)
def fetch_metadata():
    ...

Two patterns to apply:

Idempotent retry for safe operations (reads, idempotent writes).
At-most-once for non-idempotent operations — use an idempotency token, not blind retry.

Exponential backoff with jitter is the default; without jitter, retries from multiple instances synchronize and create thundering herds.

Testing Automation

Scripts that run in production deserve tests. Python makes this straightforward:

import subprocess
from unittest.mock import patch
import pytest

def test_list_nodes_parses_kubectl_output(tmp_path):
    fake_output = '{"items": [{"metadata": {"name": "node1"}, ...}]}'
    with patch("subprocess.run") as mock:
        mock.return_value.stdout = fake_output
        nodes = list_nodes(tmp_path / "kubeconfig")
    assert len(nodes) == 1

For bash, bats (Bash Automated Testing System) provides a test framework. Less commonly used, but appropriate for scripts that are hard to rewrite in Python.

For the most consequential scripts — production deploy automation, infrastructure provisioning — also add integration tests that exercise the script against a real-ish environment (LocalStack for AWS, k3d for Kubernetes, ephemeral test VMs).

Configuration Management Tools

For structured operations across many machines, scripts don’t scale. Configuration management tools — Ansible primarily, with Salt and Chef in some environments — take over.

When to reach for them:

Multiple machines doing the same configuration.
Declarative desired state that the tool reconciles.
Multi-step procedures with dependencies.
Reusable roles for common configurations.

Ansible’s strength is that it’s mostly Python under the hood; for the edge cases where the built-in modules don’t fit, dropping into Python via the script or command module is a smooth escape hatch. The honest reality: most enterprises run Ansible for the structured operations and Python/bash for everything else.

Hybrid Environments Specifically

Mixed Linux/Windows fleets have specific operational patterns worth knowing:

Ansible with WinRM can manage Windows targets from a Linux control node. The Windows modules are mature.
OpenSSH on Windows (built into modern Windows Server) lets you SSH to Windows targets and run PowerShell remotely. Simpler than WinRM for some workflows.
PowerShell Core is cross-platform; scripts written in PowerShell 7 can run on Linux. Reasonable choice when the team has stronger PowerShell skills.
Avoid running the same script on different OSes via OS-detection. Two clean per-OS implementations are easier to maintain than one cross-OS spaghetti.

For air-gapped or restricted environments, package scripts and their dependencies for offline distribution. Python pip install --download to a wheelhouse, vendored as a tarball, installed via local pip.

Closing

Operational scripting is one of those skills that determines whether an organization runs smoothly or chaotically, and it gets less attention than its impact warrants. The principles are durable: bash for short, OS-orchestrating glue with strict mode and ShellCheck; Python for anything structured, with type hints, logging, argparse, and tests; idempotency for everything that mutates; structured output for composability; secrets fetched at runtime never committed; retries with jitter and backoff for anything over the network; configuration management tools for fleet-scale structured operations. None of this is new, and that’s the point — these patterns work because they’ve been refined over decades by operators dealing with real failure modes. A script is infrastructure as much as a Terraform module is, and the discipline of writing it like infrastructure — versioned, tested, idempotent, observable — is what distinguishes automation that makes systems more reliable from automation that becomes the next outage.