

Rebuilding Infrastructure for a Multi-National Snow Trip Specialist
Industry
Entertainment & Leisure
Company size
10 - 50 Employees
About
Oz Snow is an Australian-owned and operated tour operator and resort owner that specializes in affordable, all-inclusive snow holiday packages for skiing and snowboarding. They cater heavily to beginners, students, groups, and families by bundling transportation, accommodation, lift passes, and gear hire in one place.
"Halcrow took over an environment that was fragile and opaque, and made it resilient fast. We went from peak-season outages and failed checkouts to a full season of 100% uptime, with clear monitoring and alerts so we were never surprised."
100% uptime through ski season
Prior there were frequent crashes during surges, now OzSnow has 100% uptime through the full ski season
Isolated environments with auto-scaling
Previous infrastructure model was 18 containers on 1 EC2 instance, now Halcrow has implemented isolated environments with auto-scaling
We've been inside enough initiatives to know where the value actually is and where businesses waste on technology.
Book Your Diagnostic Call
The Situation
OzSnow is a one-stop-shop travel operator for snow trips: they book the packages, run the buses, own the hotels, and manage the hospitality. Their customers travel to resorts across Australia and Japan. The business runs on a model with brutal seasonal concentration — almost all revenue flows through a short, intense winter peak when customers are researching, booking, and transacting.
When peak season arrived, the platform couldn't hold.
Customers browsing packages or attempting to book were met with crashed pages. Transactions failed. During major promotional periods — the moments of highest marketing spend and highest purchase intent — the entire platform became unavailable. For an e-commerce business with a narrow seasonal window, each outage was direct revenue loss.
What we found when we took over the environment
OzSnow was running 18 separate websites serving customers globally. All 18 were deployed the same way: 18 Docker containers running on a single AWS EC2 instance. No dedicated DevOps management. No architectural separation between sites. No monitoring that showed what was happening until after customers were already experiencing failures.
Problem 1: No capacity to absorb traffic spikes
When peak season arrived and promotional campaigns drove traffic, the single server's CPU and memory were exhausted. The system had no way to scale. More traffic meant slower pages, then errors, then complete unavailability.
Problem 2: No isolation between applications
If one site experienced a surge or a code issue, it could destabilise the entire server and take all 18 sites offline simultaneously. A problem on one country's booking site could kill Australia's checkout. There was no blast radius containment. During peak sales periods, that's exactly what happened. Repeatedly.
Key Result Metric (KMR): 100% uptime during peak ski season. Zero revenue loss from infrastructure failure.
Why they called us
OzSnow's leadership understood they had an infrastructure problem. What they hadn't diagnosed was whether the fix was simply "a bigger server" or something more structural.
The previous configuration had been built by a development team focused on getting sites live, not on operating them at scale during peak demand. Nobody had been responsible for infrastructure resilience. The problem had grown until it became critical — then it became Halcrow's to solve.
Law 9: The constraint is usually organisational, not technical. The technical fix wasn't complex. The structural failure was the absence of anyone with accountability for how the infrastructure performed under pressure.
How we worked
How We Worked Embedded System: Infrastructure Ownership from Day One We took complete ownership of the infrastructure environment. Not consulting at arm's length — direct access to the AWS account, the EC2 instances, the container configurations, and the monitoring layer.
The first step was visibility.
Before changing anything, we needed to understand exactly where the pressure was being applied. We introduced real-time performance monitoring across all 18 sites — CPU, memory, request latency, error rates. Not just checking whether sites were up, but understanding the pattern of how they degraded under load.
This monitoring revealed the traffic distribution problem: during peak periods, two or three sites were responsible for over 80% of the load, but all 18 shared the same resources. The high-traffic sites were starving the low-traffic sites, which were in turn destabilising the high-traffic sites.
Law 1: Distance is the enemy of speed. Without visibility, the infrastructure team was always reacting to outages after the fact. Monitoring moved the response from reactive to anticipatory.
Recovery Phase
Phase 1 Architecture Redesign (Weeks 1-3): Eliminate the single point of failure
The core architectural change: move from 18 containers on one instance to isolated environments per application.
Each site deployed within its own isolated container environment. Traffic surge on one site cannot destabilise others. A code issue on one country's booking flow doesn't take Australian checkout offline.
This is not a complex architectural principle — isolated environments are standard infrastructure practice. The previous architecture simply hadn't been designed by anyone thinking about failure modes under load.
Phase 2 Auto-Scaling Implementation (Weeks 3-5): Let the platform expand and contract automatically with demand.
Auto-scaling rules configured: when CPU or memory crosses defined thresholds on any site's environment, additional compute spins up automatically. When demand falls, it scales back down.
The business implication: peak season promotional bursts — the moments when Ozsnow ran its most important marketing campaigns — would no longer be the moments of highest failure risk. The platform would scale to meet the traffic.
Cost implication: Auto-scaling with proper rightsizing meant OzSnow was no longer paying for peak-capacity compute year-round. Resources expand when needed, contract when not. Monthly AWS spend reduced approximately 40% compared to the over-provisioned single-instance model.
Phase 3 Traffic Routing Simplification (Weeks 5-7): Reduce infrastructure management overhead
The 18-site architecture had accumulated multiple entry points, separate load balancers, and routing complexity that was difficult to maintain. We redesigned so all incoming traffic was handled through a single unified gateway — one load balancer, one routing layer, 18 applications behind it.
Simpler to manage. Fewer failure points. Faster diagnostic path when something goes wrong.
Phase 4: Monitoring and Alerting (Ongoing): Never be surprised by a failure again.
Alert thresholds configured for each site individually. Notification path defined: infrastructure anomaly → Halcrow alerted immediately → OzSnow notified within 15 minutes of anything that affects customer experience.
The monitoring layer we built became the operational foundation for ongoing maintenance. Problems caught at threshold, not at crash.
The Outcome Metric (Before / After)
Infrastructure model: 18 containers on 1 EC2 instance / Isolated environments with auto-scaling
Monitoring: None / Real-time across all 18 sites
Peak season uptime: Frequent crashes during surges / 100% uptime through the full ski season
Traffic spike response: Manual (or none) / Automatic scaling
Blast radius: Single-site failure → all 18 affected / Each site isolated
Monthly AWS cost: Over-provisioned baseline / ~40% reduction via auto-scaling
Incident response: Reactive (customer reports) / Proactive (threshold alerts)
The result that mattered most: OzSnow went through a full peak ski season — their entire annual revenue window — with zero infrastructure outages during major traffic periods. The platform was stable when it needed to be stable.
WHY THIS WORKED
The Single Point of Failure Pattern
This infrastructure pattern repeats across small-to-medium businesses that have grown their digital footprint faster than their infrastructure design:
Developer deploys product. Works fine at low traffic. Business grows. Developer adds more products to the same server. Still works. Business runs first major campaign.
Traffic spikes. Everything crashes.
The fix is never "more RAM." More RAM on a single instance delays the crash — it doesn't prevent it. The fix is architectural: isolation, auto-scaling, proper blast radius containment.
Law 8: Activity is not progress. Previous attempts to address OzSnow's stability had involved increasing server size — activity without addressing the structural cause. The crashes continued because the architecture hadn't changed.
The Monitoring Prerequisite
You cannot fix what you cannot see. The first week of this engagement was spent on monitoring because without it, we were operating blind — guessing where the problem was rather than observing it.
Every infrastructure recovery project should start with visibility. Not with immediate changes to the architecture. You need to understand the failure mode before you change the thing that's failing.
The Pattern This Reveals
Seasonal businesses are infrastructure stress tests. Most platforms aren't designed for the stress.
If your business has a revenue concentration peak — ski season, Christmas, tax time, any defined period where traffic and transaction volume spike — your infrastructure needs to be designed for that peak, not for your average day.
Auto-scaling exists precisely for this scenario. It's not expensive. It's not technically complex. The businesses that don't have it simply haven't had anyone responsible for thinking about what happens when the promotional campaign works.
what you're buying
What You're Buying If your platform crashes when you need it most, we fix the structural causes, not the symptoms. Isolation, auto-scaling, traffic routing simplification.
After month one, you know your infrastructure will hold during peak. Ready to stop losing revenue during your most important sales periods? Book a 20-minute call with Sam Halcrow on 0431197004 or sam@halcrow.com.au.
—
Case study written May 2026. OzSnow is a real client. All data sourced from infrastructure performance monitoring, AWS billing records, and peak season uptime logs.
See how other teams are winning with Halcrow


Building the Technology Layer for Australia's Dark Kitchen Revolution
Metric #1
Metric #1 description
Read more

