Independent HPC and Linux infrastructure consulting

HPC and storage systems, reviewed by an operator.

Blue Ridge Compute provides independent HPC and Linux infrastructure consulting for research computing teams, data-intensive organizations, Slurm schedulers, Lustre/ZFS storage, backup/recovery readiness, performance triage, procurement review, and operations planning.

Focused senior technical review for universities, labs, small technical organizations, and data-intensive teams that need field-credible infrastructure judgment without a permanent senior hire.

About

Senior HPC judgment for teams that live with the consequences.

Vendors know their products. Internal teams know their users. Blue Ridge Compute helps bridge the gap with independent technical review from someone who has operated production research infrastructure and lived with real failure modes.

Blue Ridge Compute is an independent HPC and Linux infrastructure consultancy focused on research computing, large shared storage, scheduler operations, backup/recovery readiness, monitoring, and operational documentation.

The work is grounded in years of production experience operating HPC and multi-petabyte storage infrastructure for a major scientific research environment. That background shapes every engagement: practical, risk-aware, and focused on what your team can actually maintain.

Whether you are planning a storage refresh, troubleshooting performance complaints, reviewing backup and recovery practices, migrating platforms, or documenting an inherited cluster, the goal is the same: help you understand what is fragile, what is risky, and what to fix first.

5.5 PB+ Experience with production Lustre environments at petabyte scale.
200+ Researchers supported across scientific computing environments.
6+ yrs Production HPC infrastructure engineering and operations.
PhD Scientific background that helps connect infrastructure to research impact.

Services

Focused work where infrastructure experience matters.

Blue Ridge Compute is not a general IT shop. Engagements are designed around HPC systems, Linux infrastructure, large-scale storage, backup/recovery readiness, research computing operations, and the planning decisions that shape them.

Common engagements include HPC consulting, Slurm configuration review, Lustre performance troubleshooting, ZFS storage reliability review, Linux infrastructure health checks, backup and recovery readiness review, Linux cluster operations assessment, HPC procurement review, and research computing infrastructure documentation.

Storage

Lustre consulting for performance and reliability

Performance triage, workload and I/O-pattern review, storage health assessment, failure trend review, and practical recommendations for production parallel I/O environments.

Compute

Linux HPC and Slurm operations review

Scheduler policy review, Slurm configuration guidance, RHEL lifecycle planning, compute-node operations review, and implementation support for clearly scoped projects.

Architecture

Storage and compute design review

Independent assessment of proposed architectures, vendor quotes, topology decisions, network assumptions, capacity planning, support assumptions, and long-term operational burden.

Migration

Platform migration planning

Planning and risk review for scheduler migrations, RHEL 7 to 8/9 transitions, authentication modernization, storage moves, and research workflow continuity.

Operations

Monitoring, automation, and runbooks

Alert review, operational checklists, recurring task automation, handoff documentation, and procedures that reduce single-person knowledge risk.

Assessment

Research computing health checks

A structured review of compute and storage environments with severity-ranked findings, quick wins, medium-term risks, and practical remediation paths.

Linux infrastructure

Infrastructure health checks for technical teams

Fixed-scope review of Linux servers, patching practices, monitoring coverage, access patterns, operational documentation, and risks that small technical teams may not have time to assess internally.

Resilience

Backup and recovery readiness review

Review of backup coverage, restore assumptions, retention practices, offsite copies, monitoring signals, and recovery procedures so teams know what would actually work after a failure.

Problems

Sound familiar?

Most teams do not need a permanent consultant. They need a grounded review of what is risky, what is fragile, and what deserves attention before the next outage, purchase, or migration.

The cluster works, but no one is sure it is configured correctly.

Users complain about performance, but the bottleneck is unclear.

Storage, backups, and recovery assumptions are growing faster than monitoring, testing, or replacement procedures.

Linux infrastructure is business-critical, but patching, access, monitoring, and recovery practices have grown organically.

You are reviewing a vendor proposal and want a second set of technical eyes.

Operational documentation exists mostly in one person’s head.

You inherited an HPC, Linux, or storage environment and need to understand what you actually have.

Engagement model

Remote-first, written, and scoped around decisions.

Engagements are built for senior technical review, planning, troubleshooting, and documentation. The default output is not an endless meeting cycle — it is a clear written assessment, decision memo, or remediation plan your team can use.

Discover

Short technical intake

We define the system, problem, timeline, constraints, and what a useful outcome looks like.

Review

Config, architecture, and operations

Work may include configuration review, diagrams, vendor quotes, monitoring screenshots, logs, procedures, or scheduled screen-share sessions.

Document

Findings and priorities

You receive written findings with risks, tradeoffs, remediation options, and next steps ranked by importance.

Support

Follow-through by agreement

Optional follow-up can include implementation guidance, additional review, or scheduled troubleshooting during agreed support windows.

Fit

Designed for focused research computing and Linux infrastructure work.

Blue Ridge Compute is best suited for teams with real HPC, Linux infrastructure, storage, scheduler, backup/recovery, or research computing decisions to make — especially when an independent senior technical review would reduce risk.

Good fit

  • University and lab research computing teams
  • Small HPC groups without a dedicated senior HPC engineer
  • Teams using or evaluating Lustre, Slurm, Linux clusters, InfiniBand, or large shared storage
  • Organizations planning refreshes, migrations, backup/recovery improvements, monitoring improvements, or operational handoffs
  • Small technical teams that need senior Linux infrastructure review without ongoing helpdesk or staff-augmentation scope

Where we add the most value

  • Independent review before a major compute, storage, or network purchase
  • Performance, reliability, or post-incident questions where the bottleneck or failure mode is unclear
  • Inherited infrastructure that needs documentation, risk assessment, or operational cleanup
  • Research workflows where infrastructure decisions affect scientific productivity

Availability and scope

Most work is remote-first and scheduled. Limited hands-on implementation may be available for clearly scoped projects. Emergency response, production access, recurring support, and after-hours availability require a separate agreement with explicit expectations.

Packages

Fixed-scope engagements that are easier to approve.

Starting prices are intended for bounded remote reviews with clearly defined inputs and deliverables. Final pricing depends on environment size, access requirements, number of systems reviewed, urgency, meetings, vendor involvement, and implementation scope.

Assessment

Research Computing Health Check

Starting at $3,500

Structured review of Linux HPC, Slurm, shared compute, monitoring, documentation, and operational risk.

Typical deliverables
  • Severity-ranked findings
  • Quick wins and medium-term recommendations
  • Follow-up review call

Infrastructure

Linux Infrastructure Health Check

Starting at $3,500

Fixed-scope review of Linux servers, monitoring, patching, access patterns, documentation, recurring operations, and operational risk for small technical organizations.

Typical deliverables
  • Severity-ranked infrastructure findings
  • Operational quick wins and risk-reduction steps
  • Follow-up review call

Resilience

Backup & Recovery Readiness Review

Starting at $4,000

Review of backup coverage, retention, offsite copies, restore testing, monitoring, documentation, and recovery assumptions before a failed restore becomes the first real test.

Typical deliverables
  • Backup and restore-risk summary
  • Coverage, retention, and testing gaps
  • Practical recovery-readiness recommendations

Storage

Lustre/ZFS Storage Consulting & Reliability Assessment

Starting at $4,500

Lustre, ZFS, and Linux storage review focused on reliability, drive health, monitoring, backup assumptions, replacement workflows, and operational gaps.

Typical deliverables
  • Reliability and failure-mode summary
  • Monitoring and replacement planning gaps
  • Prioritized remediation plan

Performance

Lustre Performance Consulting & I/O Triage

Starting at $5,000 limited triage

Review of observed performance issues, workload patterns, available metrics, configuration, striping practices, and test results. Expanded benchmarking or production tuning requires separate scope.

Typical deliverables
  • Observation and evidence summary
  • Likely bottlenecks and recommended next tests
  • Tuning options with implementation guidance

Compute

Slurm Scheduler & Compute Operations Review

Starting at $3,500

Review of Slurm policy, partitions, limits, priorities, node state patterns, cgroups, job failure signals, and operational practices for Linux HPC clusters.

Typical deliverables
  • Scheduler policy and configuration findings
  • Compute-node operations observations
  • Practical cleanup and risk-reduction steps

Stability

Post-Incident Stability Review

Starting at $5,000

Non-emergency review of outages, recurring instability, unusual failures, or unclear recovery events after service has been restored.

Typical deliverables
  • Incident timeline and evidence summary
  • Likely contributing factors
  • Prevention, monitoring, and runbook recommendations

Operations

Runbook & Monitoring Add-On

Starting at $2,500

Operational checklist creation, alert triage review, recurring task documentation, and handoff procedures, usually paired with an assessment, incident review, or reliability engagement.

Typical deliverables
  • Runbook outline or completed procedures
  • Monitoring gap analysis
  • Admin handoff documentation

Advisory

Advisory Retainer

Starting at $2,500 / month

Recurring senior technical review for teams that need periodic architecture feedback, second opinions, or scheduled troubleshooting. Base retainers include a defined monthly hour limit and scheduled support only; emergency response is not included unless separately contracted.

Typical deliverables
  • Recurring advisory calls
  • Async technical review
  • Defined monthly hour limit
  • Scheduled support windows by agreement

Who we serve

Research and data-intensive teams with real infrastructure stakes.

The best engagements are with teams that already understand the value of reliable compute, storage, backups, and operations, but need targeted senior review, planning, or documentation help.

Research universities

Support for campus and departmental computing

Senior HPC perspective for research computing teams that need focused expertise without adding a permanent hire.

Public research

Defined infrastructure review and planning

Support for public-sector and federally funded research teams with scoped HPC infrastructure, storage, or documentation needs.

Enterprise HPC

Biotech, genomics, and data-intensive teams

Infrastructure guidance for organizations scaling Linux compute, shared storage, scheduler policy, and data-intensive workflows, including environments preparing for GPU or AI workloads.

Small technical teams

Linux infrastructure review without helpdesk scope

Fixed-scope assessment for SaaS, engineering, and data-intensive teams that need senior review of Linux operations, backups, monitoring, and recovery readiness.

Inherited systems

Clarity when ownership changes

Assessment and documentation for teams that have inherited systems, lost key staff, or need to reduce single-person knowledge risk.

Growing teams

Planning before scale exposes weak points

Review for teams expanding compute, storage, monitoring, or scheduler policy before small operational issues become structural problems.

Independent by design

No reseller agenda.

Blue Ridge Compute provides independent technical review. Recommendations are based on operational fit, maintainability, risk, and your team’s constraints — not hardware resale incentives.

Professional boundaries

Conflict-aware consulting

Engagements are accepted only where they do not conflict with existing employment, confidentiality obligations, or prior commitments. Production access, sensitive data, emergency support, and recurring availability must be explicitly scoped in advance.

Contact

Tell me what you are running and what is not working.

Useful first messages include the approximate size of the environment, current stack, the problem you are trying to solve, timeline, and whether you need review, troubleshooting, planning, or implementation support.

Prefer direct email?

[email protected]

Please do not include passwords, private keys, regulated data, proprietary configs, or sensitive system details in the initial inquiry. A secure exchange method can be arranged if needed.

Prefer email? Contact [email protected].