Storage
Lustre consulting for performance and reliability
Performance triage, workload and I/O-pattern review, storage health assessment, failure trend review, and practical recommendations for production parallel I/O environments.
Independent HPC and Linux infrastructure consulting
Blue Ridge Compute provides independent HPC and Linux infrastructure consulting for research computing teams, data-intensive organizations, Slurm schedulers, Lustre/ZFS storage, backup/recovery readiness, performance triage, procurement review, and operations planning.
Focused senior technical review for universities, labs, small technical organizations, and data-intensive teams that need field-credible infrastructure judgment without a permanent senior hire.
About
Vendors know their products. Internal teams know their users. Blue Ridge Compute helps bridge the gap with independent technical review from someone who has operated production research infrastructure and lived with real failure modes.
Blue Ridge Compute is an independent HPC and Linux infrastructure consultancy focused on research computing, large shared storage, scheduler operations, backup/recovery readiness, monitoring, and operational documentation.
The work is grounded in years of production experience operating HPC and multi-petabyte storage infrastructure for a major scientific research environment. That background shapes every engagement: practical, risk-aware, and focused on what your team can actually maintain.
Whether you are planning a storage refresh, troubleshooting performance complaints, reviewing backup and recovery practices, migrating platforms, or documenting an inherited cluster, the goal is the same: help you understand what is fragile, what is risky, and what to fix first.
Services
Blue Ridge Compute is not a general IT shop. Engagements are designed around HPC systems, Linux infrastructure, large-scale storage, backup/recovery readiness, research computing operations, and the planning decisions that shape them.
Common engagements include HPC consulting, Slurm configuration review, Lustre performance troubleshooting, ZFS storage reliability review, Linux infrastructure health checks, backup and recovery readiness review, Linux cluster operations assessment, HPC procurement review, and research computing infrastructure documentation.
Storage
Performance triage, workload and I/O-pattern review, storage health assessment, failure trend review, and practical recommendations for production parallel I/O environments.
Compute
Scheduler policy review, Slurm configuration guidance, RHEL lifecycle planning, compute-node operations review, and implementation support for clearly scoped projects.
Architecture
Independent assessment of proposed architectures, vendor quotes, topology decisions, network assumptions, capacity planning, support assumptions, and long-term operational burden.
Migration
Planning and risk review for scheduler migrations, RHEL 7 to 8/9 transitions, authentication modernization, storage moves, and research workflow continuity.
Operations
Alert review, operational checklists, recurring task automation, handoff documentation, and procedures that reduce single-person knowledge risk.
Assessment
A structured review of compute and storage environments with severity-ranked findings, quick wins, medium-term risks, and practical remediation paths.
Linux infrastructure
Fixed-scope review of Linux servers, patching practices, monitoring coverage, access patterns, operational documentation, and risks that small technical teams may not have time to assess internally.
Resilience
Review of backup coverage, restore assumptions, retention practices, offsite copies, monitoring signals, and recovery procedures so teams know what would actually work after a failure.
Problems
Most teams do not need a permanent consultant. They need a grounded review of what is risky, what is fragile, and what deserves attention before the next outage, purchase, or migration.
The cluster works, but no one is sure it is configured correctly.
Users complain about performance, but the bottleneck is unclear.
Storage, backups, and recovery assumptions are growing faster than monitoring, testing, or replacement procedures.
Linux infrastructure is business-critical, but patching, access, monitoring, and recovery practices have grown organically.
You are reviewing a vendor proposal and want a second set of technical eyes.
Operational documentation exists mostly in one person’s head.
You inherited an HPC, Linux, or storage environment and need to understand what you actually have.
Engagement model
Engagements are built for senior technical review, planning, troubleshooting, and documentation. The default output is not an endless meeting cycle — it is a clear written assessment, decision memo, or remediation plan your team can use.
Discover
We define the system, problem, timeline, constraints, and what a useful outcome looks like.
Review
Work may include configuration review, diagrams, vendor quotes, monitoring screenshots, logs, procedures, or scheduled screen-share sessions.
Document
You receive written findings with risks, tradeoffs, remediation options, and next steps ranked by importance.
Support
Optional follow-up can include implementation guidance, additional review, or scheduled troubleshooting during agreed support windows.
Fit
Blue Ridge Compute is best suited for teams with real HPC, Linux infrastructure, storage, scheduler, backup/recovery, or research computing decisions to make — especially when an independent senior technical review would reduce risk.
Most work is remote-first and scheduled. Limited hands-on implementation may be available for clearly scoped projects. Emergency response, production access, recurring support, and after-hours availability require a separate agreement with explicit expectations.
Packages
Starting prices are intended for bounded remote reviews with clearly defined inputs and deliverables. Final pricing depends on environment size, access requirements, number of systems reviewed, urgency, meetings, vendor involvement, and implementation scope.
Assessment
Starting at $3,500
Structured review of Linux HPC, Slurm, shared compute, monitoring, documentation, and operational risk.
Infrastructure
Starting at $3,500
Fixed-scope review of Linux servers, monitoring, patching, access patterns, documentation, recurring operations, and operational risk for small technical organizations.
Resilience
Starting at $4,000
Review of backup coverage, retention, offsite copies, restore testing, monitoring, documentation, and recovery assumptions before a failed restore becomes the first real test.
Procurement
Starting at $5,000
Second-opinion review of proposed compute, storage, network, support, and scheduler architecture before purchase or implementation. Multi-vendor comparisons or RFP support may require expanded scope.
Storage
Starting at $4,500
Lustre, ZFS, and Linux storage review focused on reliability, drive health, monitoring, backup assumptions, replacement workflows, and operational gaps.
Performance
Starting at $5,000 limited triage
Review of observed performance issues, workload patterns, available metrics, configuration, striping practices, and test results. Expanded benchmarking or production tuning requires separate scope.
Compute
Starting at $3,500
Review of Slurm policy, partitions, limits, priorities, node state patterns, cgroups, job failure signals, and operational practices for Linux HPC clusters.
Stability
Starting at $5,000
Non-emergency review of outages, recurring instability, unusual failures, or unclear recovery events after service has been restored.
Operations
Starting at $2,500
Operational checklist creation, alert triage review, recurring task documentation, and handoff procedures, usually paired with an assessment, incident review, or reliability engagement.
Advisory
Starting at $2,500 / month
Recurring senior technical review for teams that need periodic architecture feedback, second opinions, or scheduled troubleshooting. Base retainers include a defined monthly hour limit and scheduled support only; emergency response is not included unless separately contracted.
Who we serve
The best engagements are with teams that already understand the value of reliable compute, storage, backups, and operations, but need targeted senior review, planning, or documentation help.
Research universities
Senior HPC perspective for research computing teams that need focused expertise without adding a permanent hire.
Public research
Support for public-sector and federally funded research teams with scoped HPC infrastructure, storage, or documentation needs.
Enterprise HPC
Infrastructure guidance for organizations scaling Linux compute, shared storage, scheduler policy, and data-intensive workflows, including environments preparing for GPU or AI workloads.
Small technical teams
Fixed-scope assessment for SaaS, engineering, and data-intensive teams that need senior review of Linux operations, backups, monitoring, and recovery readiness.
Inherited systems
Assessment and documentation for teams that have inherited systems, lost key staff, or need to reduce single-person knowledge risk.
Growing teams
Review for teams expanding compute, storage, monitoring, or scheduler policy before small operational issues become structural problems.
Independent by design
Blue Ridge Compute provides independent technical review. Recommendations are based on operational fit, maintainability, risk, and your team’s constraints — not hardware resale incentives.
Professional boundaries
Engagements are accepted only where they do not conflict with existing employment, confidentiality obligations, or prior commitments. Production access, sensitive data, emergency support, and recurring availability must be explicitly scoped in advance.
Contact
Useful first messages include the approximate size of the environment, current stack, the problem you are trying to solve, timeline, and whether you need review, troubleshooting, planning, or implementation support.
Prefer direct email?
[email protected]