Independent HPC and Linux infrastructure consulting

HPC and storage systems, reviewed by an operator.

Blue Ridge Compute provides independent HPC and Linux infrastructure consulting for research computing teams, data-intensive organizations, Slurm schedulers, Lustre/ZFS storage, backup/recovery readiness, performance triage, procurement review, and operations planning.

Start a conversation View fixed-scope options

Focused senior technical review for universities, labs, small technical organizations, and data-intensive teams that need field-credible infrastructure judgment without a permanent senior hire.

About

Senior HPC judgment for teams that live with the consequences.

Vendors know their products. Internal teams know their users. Blue Ridge Compute helps bridge the gap with independent technical review from someone who has operated production research infrastructure and lived with real failure modes.

Blue Ridge Compute is an independent HPC and Linux infrastructure consultancy focused on research computing, large shared storage, scheduler operations, backup/recovery readiness, monitoring, and operational documentation.

The work is grounded in years of production experience operating HPC and multi-petabyte storage infrastructure for a major scientific research environment. That background shapes every engagement: practical, risk-aware, and focused on what your team can actually maintain.

Whether you are planning a storage refresh, troubleshooting performance complaints, reviewing backup and recovery practices, migrating platforms, or documenting an inherited cluster, the goal is the same: help you understand what is fragile, what is risky, and what to fix first.

5.5 PB+ Experience with production Lustre environments at petabyte scale.

200+ Researchers supported across scientific computing environments.

6+ yrs Production HPC infrastructure engineering and operations.

PhD Scientific background that helps connect infrastructure to research impact.

Services

Focused work where infrastructure experience matters.

Blue Ridge Compute is not a general IT shop. Engagements are designed around HPC systems, Linux infrastructure, large-scale storage, backup/recovery readiness, research computing operations, and the planning decisions that shape them.

Common engagements include HPC consulting, Slurm configuration review, Lustre performance troubleshooting, ZFS storage reliability review, Linux infrastructure health checks, backup and recovery readiness review, Linux cluster operations assessment, HPC procurement review, and research computing infrastructure documentation.

Storage

Lustre consulting for performance and reliability

Performance triage, workload and I/O-pattern review, storage health assessment, failure trend review, and practical recommendations for production parallel I/O environments.

Compute

Linux HPC and Slurm operations review

Scheduler policy review, Slurm configuration guidance, RHEL lifecycle planning, compute-node operations review, and implementation support for clearly scoped projects.

Architecture

Storage and compute design review

Independent assessment of proposed architectures, vendor quotes, topology decisions, network assumptions, capacity planning, support assumptions, and long-term operational burden.

Migration

Platform migration planning

Planning and risk review for scheduler migrations, RHEL 7 to 8/9 transitions, authentication modernization, storage moves, and research workflow continuity.

Operations

Monitoring, automation, and runbooks

Alert review, operational checklists, recurring task automation, handoff documentation, and procedures that reduce single-person knowledge risk.

Assessment

Research computing health checks

A structured review of compute and storage environments with severity-ranked findings, quick wins, medium-term risks, and practical remediation paths.

Linux infrastructure

Infrastructure health checks for technical teams

Fixed-scope review of Linux servers, patching practices, monitoring coverage, access patterns, operational documentation, and risks that small technical teams may not have time to assess internally.

Resilience

Backup and recovery readiness review

Review of backup coverage, restore assumptions, retention practices, offsite copies, monitoring signals, and recovery procedures so teams know what would actually work after a failure.

Problems

Sound familiar?

Most teams do not need a permanent consultant. They need a grounded review of what is risky, what is fragile, and what deserves attention before the next outage, purchase, or migration.

The cluster works, but no one is sure it is configured correctly.

Users complain about performance, but the bottleneck is unclear.

Storage, backups, and recovery assumptions are growing faster than monitoring, testing, or replacement procedures.

Linux infrastructure is business-critical, but patching, access, monitoring, and recovery practices have grown organically.

You are reviewing a vendor proposal and want a second set of technical eyes.

Operational documentation exists mostly in one person’s head.

You inherited an HPC, Linux, or storage environment and need to understand what you actually have.

Engagement model

Remote-first, written, and scoped around decisions.

Engagements are built for senior technical review, planning, troubleshooting, and documentation. The default output is not an endless meeting cycle — it is a clear written assessment, decision memo, or remediation plan your team can use.

Discover

Short technical intake

We define the system, problem, timeline, constraints, and what a useful outcome looks like.

Review

Config, architecture, and operations

Work may include configuration review, diagrams, vendor quotes, monitoring screenshots, logs, procedures, or scheduled screen-share sessions.

Document

Findings and priorities

You receive written findings with risks, tradeoffs, remediation options, and next steps ranked by importance.

Support

Follow-through by agreement

Optional follow-up can include implementation guidance, additional review, or scheduled troubleshooting during agreed support windows.

Fit

Designed for focused research computing and Linux infrastructure work.

Blue Ridge Compute is best suited for teams with real HPC, Linux infrastructure, storage, scheduler, backup/recovery, or research computing decisions to make — especially when an independent senior technical review would reduce risk.

Good fit

University and lab research computing teams
Small HPC groups without a dedicated senior HPC engineer
Teams using or evaluating Lustre, Slurm, Linux clusters, InfiniBand, or large shared storage
Organizations planning refreshes, migrations, backup/recovery improvements, monitoring improvements, or operational handoffs
Small technical teams that need senior Linux infrastructure review without ongoing helpdesk or staff-augmentation scope

Where we add the most value

Independent review before a major compute, storage, or network purchase
Performance, reliability, or post-incident questions where the bottleneck or failure mode is unclear
Inherited infrastructure that needs documentation, risk assessment, or operational cleanup
Research workflows where infrastructure decisions affect scientific productivity

Availability and scope

Most work is remote-first and scheduled. Limited hands-on implementation may be available for clearly scoped projects. Emergency response, production access, recurring support, and after-hours availability require a separate agreement with explicit expectations.

Packages

Fixed-scope engagements that are easier to approve.

Starting prices are intended for bounded remote reviews with clearly defined inputs and deliverables. Final pricing depends on environment size, access requirements, number of systems reviewed, urgency, meetings, vendor involvement, and implementation scope.

Assessment

Research Computing Health Check

Starting at $3,500

Structured review of Linux HPC, Slurm, shared compute, monitoring, documentation, and operational risk.

Typical deliverables

Severity-ranked findings
Quick wins and medium-term recommendations
Follow-up review call

Infrastructure

Linux Infrastructure Health Check

Starting at $3,500

Fixed-scope review of Linux servers, monitoring, patching, access patterns, documentation, recurring operations, and operational risk for small technical organizations.

Typical deliverables

Severity-ranked infrastructure findings
Operational quick wins and risk-reduction steps
Follow-up review call

Resilience

Backup & Recovery Readiness Review

Starting at $4,000

Review of backup coverage, retention, offsite copies, restore testing, monitoring, documentation, and recovery assumptions before a failed restore becomes the first real test.

Typical deliverables

Backup and restore-risk summary
Coverage, retention, and testing gaps
Practical recovery-readiness recommendations

Procurement

Independent HPC Procurement & Architecture Review

Starting at $5,000

Second-opinion review of proposed compute, storage, network, support, and scheduler architecture before purchase or implementation. Multi-vendor comparisons or RFP support may require expanded scope.

Typical deliverables

Vendor quote and architecture risk review
Procurement questions and decision memo
Operational burden assessment

Storage

Lustre/ZFS Storage Consulting & Reliability Assessment

Starting at $4,500

Lustre, ZFS, and Linux storage review focused on reliability, drive health, monitoring, backup assumptions, replacement workflows, and operational gaps.

Typical deliverables

Reliability and failure-mode summary
Monitoring and replacement planning gaps
Prioritized remediation plan

Performance

Lustre Performance Consulting & I/O Triage

Starting at $5,000 limited triage

Review of observed performance issues, workload patterns, available metrics, configuration, striping practices, and test results. Expanded benchmarking or production tuning requires separate scope.

Typical deliverables

Observation and evidence summary
Likely bottlenecks and recommended next tests
Tuning options with implementation guidance

Compute

Slurm Scheduler & Compute Operations Review

Starting at $3,500

Review of Slurm policy, partitions, limits, priorities, node state patterns, cgroups, job failure signals, and operational practices for Linux HPC clusters.

Typical deliverables

Scheduler policy and configuration findings
Compute-node operations observations
Practical cleanup and risk-reduction steps

Stability

Post-Incident Stability Review

Starting at $5,000

Non-emergency review of outages, recurring instability, unusual failures, or unclear recovery events after service has been restored.

Typical deliverables

Incident timeline and evidence summary
Likely contributing factors
Prevention, monitoring, and runbook recommendations

Operations

Runbook & Monitoring Add-On

Starting at $2,500

Operational checklist creation, alert triage review, recurring task documentation, and handoff procedures, usually paired with an assessment, incident review, or reliability engagement.

Typical deliverables

Runbook outline or completed procedures
Monitoring gap analysis
Admin handoff documentation

Advisory

Advisory Retainer

Starting at $2,500 / month

Recurring senior technical review for teams that need periodic architecture feedback, second opinions, or scheduled troubleshooting. Base retainers include a defined monthly hour limit and scheduled support only; emergency response is not included unless separately contracted.

Typical deliverables

Recurring advisory calls
Async technical review
Defined monthly hour limit
Scheduled support windows by agreement

Who we serve

Research and data-intensive teams with real infrastructure stakes.

The best engagements are with teams that already understand the value of reliable compute, storage, backups, and operations, but need targeted senior review, planning, or documentation help.

Research universities

Support for campus and departmental computing

Senior HPC perspective for research computing teams that need focused expertise without adding a permanent hire.

Public research

Defined infrastructure review and planning

Support for public-sector and federally funded research teams with scoped HPC infrastructure, storage, or documentation needs.

Enterprise HPC

Biotech, genomics, and data-intensive teams

Infrastructure guidance for organizations scaling Linux compute, shared storage, scheduler policy, and data-intensive workflows, including environments preparing for GPU or AI workloads.

Small technical teams

Linux infrastructure review without helpdesk scope

Fixed-scope assessment for SaaS, engineering, and data-intensive teams that need senior review of Linux operations, backups, monitoring, and recovery readiness.

Inherited systems

Clarity when ownership changes

Assessment and documentation for teams that have inherited systems, lost key staff, or need to reduce single-person knowledge risk.

Growing teams

Planning before scale exposes weak points

Review for teams expanding compute, storage, monitoring, or scheduler policy before small operational issues become structural problems.

Independent by design

No reseller agenda.

Blue Ridge Compute provides independent technical review. Recommendations are based on operational fit, maintainability, risk, and your team’s constraints — not hardware resale incentives.

Professional boundaries

Conflict-aware consulting

Engagements are accepted only where they do not conflict with existing employment, confidentiality obligations, or prior commitments. Production access, sensitive data, emergency support, and recurring availability must be explicitly scoped in advance.

Contact

Tell me what you are running and what is not working.

Useful first messages include the approximate size of the environment, current stack, the problem you are trying to solve, timeline, and whether you need review, troubleshooting, planning, or implementation support.

Prefer direct email?

[email protected]

Name

Organization

Type of engagement

Environment and problem

Please do not include passwords, private keys, regulated data, proprietary configs, or sensitive system details in the initial inquiry. A secure exchange method can be arranged if needed.

Prefer email? Contact [email protected].