We are living in an unprecedented time of innovation, ingenuity, communication and collaboration. Businesses are more agile now than they’ve ever been. We’re more digitally engaged. We’ve moved many of our systems into the cloud and have more computing power, durable storage and scalability at our fingertips than ever before.
While very few of us had the notion of weathering a global pandemic in mind when we’d made the decision to go digital, it has undoubtedly proven valuable in this new uncertain time.
Curbside pickup, online ordering, deliveries, pickup in store, telemedicine, distance learning, and myriad other systems, once luxurious and innovative have become cornerstones of our lifestyle now more than ever. Not only have they allowed many businesses to adapt quickly to the changes Coronavirus has forced on us, but these innovations have allowed millions of people to successfully comply with international guidelines on social distancing, lockdowns, and other protective measures we need to take to help fight this pandemic.
It is true that the shutdowns, lockdowns, and quarantines have had significant impacts on many people and businesses. But if we compare the impact to what it would have been even 20 years ago, many people’s lifestyles have been relatively minimally affected. Many of us are already accustomed to shopping online, waiting for deliveries, buying online and picking up at the store, waiting for a curbside pickup, grocery loading services, or grocery delivery services. Without these key innovations, we would be feeling the effects of the social distancing measure much more acutely.
One of the wonderful things that can come out of trying times like these is the sense of shared responsibility, humanity, and global community. We are all trying our best to help -- both at an individual and at a business level. Perfume makers, breweries and distilleries are making small tweaks in their manufacturing process to produce hand sanitizer. Some are even giving it away for free. Companies are offering free online learning courses to keep people engaged and educated while working from home. Others like Microsoft and Adobe are offering free software to help businesses run remotely. Amazon is giving people raises and opening 100,000 jobs. Uber Eats and Doordash are waiving commission fees.
As Coronavirus continues, a sense of normalcy will become increasingly important. We shouldn’t underestimate the impact of digital engagement in helping maintain contact with those outside our homes. In a time when everyone is looking for a way to help, one of the most impactful things we can do is to enable people to continue to live life as normally as possible and digital engagement can help with that. We can use this time to prioritize activities that will help in the short term, and continue to provide value after we’re through this.
Recommendations in this post will be geared towards those who use AWS as a cloud environment.
In general, systems should be sized to meet load requirements. There should be proactive measures taken when there is an anticipated increase in load like we are expecting with Coronavirus. The ability to easily and proactively scale up is called Scalability. Elasticity, on the other hand is more of a tactical device that’s used to handle unanticipated, temporary increases in traffic. Your systems may have already scaled up elastically in response to the increased load caused by Coronavirus. Elastic systems will try to scale up and down to meet demand, but due to system boot times on application services and databases, replication times on database servers and other factors, scaled up elastic systems are not an ideal way to operate at a higher load level over a sustained period of time. Usage levels should be investigated and baseline requirements should be adjusted to meet the new normal. System usage monitors are an important part of any system to ensure your cloud environments can operate as lean as possible while maintaining performance.
System preparation is a multi-phase process. In this post I will outline some basic first-pass activities that can be performed to help stabilize your systems.
You may have already seen some upticks in your EC2 usage. Some of this may have been handled by Autoscaling groups. Using a tool like CloudHealth, Amazon’s EC2 Compute Optimizer or even just CloudWatch can help you understand what your new normal is. The recommendations from these tools are not always 100% accurate but they give a great starting point. From the list of recommendations they give you, you can scale up Autoscaling groups baseline or tune the instance sizing of your EC2s to meet the new level of demand. While reserved instances could be purchased after this phase, I would recommend going through some more advanced analysis before committing to purchasing reservations.
Lambda and APIGateway take a lot of pain out of scaling up to meet demand. However, there are still some areas that you may need to consider.
There is the concept of a “Cold Start” that is associated with a Lambda function. A “Cold Start” is when a lambda function has gone to sleep and needs to start up again. It does this if it is not being used for 10 minutes. Cold Starts can cause latency issues, which result in poor performance. Consider provisioning and reserving capacity for your Lambda functions. Use CloudWatch to determine concurrent executions for each lambda; use that to inform what your provisioned and reserved capacity should be. You can also check to see if you’re being affected by Cold Starts by identifying if there is a gap between API Gateway response time and Lambda Execution time. If there is a significant gap there (eg: APIGateway Response time was 5 seconds, lambda execution time was 2 seconds), you may be experiencing Cold Starts.
Another thing to consider, if you’ve put your Lambdas in a VPC, is the number of Elastic Network Interfaces (ENIs) you have available. With a very high throughput application, you may run out of ENIs and need to put in an AWS Service request to get the limit raised. In addition to the ENIs, you may also want to consider the IP Range of the subnets your Lambda functions are in if they are in a VPC. Ensure there is plenty of room for them to grow. Add subnets to your Lambdas if you find yourself coming up short.
When looking at your RDS or Aurora instances, first check out if you’re using a single instance or a cluster. While we won’t go over the details in this post about upgrading to a cluster, if you are not already clustering your databases, strongly consider doing so. Aurora clusters out of the box; it takes away a lot of the risk and operational overhead from running a cluster. If you are running a single instance, your options are more limited. First assess whether you need to upgrade -- check out your detailed CloudWatch logs for your database. If you’ve seen a spike in usage, and its highs are getting close to the 70% range, you’re a good candidate to upgrade the instance size. Tools like CloudHealth will also help you by giving you sizing recommendations.
If you’re already using a cluster you’re in good shape. Check out the usage on your readers and your writer instances. If your reader instances are getting loaded down, consider adding autoscaling if you don’t have it already, and either upgrading the reader instance sizes or adding reader instances. Consider the cost difference between horizontal and vertical scaling. If you are running a single writer setup and your writer instance is getting bogged down, consider upgrading it. This can be done without downtime by adding another larger instance to the cluster and promoting it to a writer.
There are a few more advanced topics here that we can talk about in later posts around proper usage of readers and writers in your application code, upgrading to Aurora, and high availability / replication and disaster recovery.
DynamoDB can operate in two modes: On-Demand and Provisioned. I will be writing with the assumption that you’re using provisioned capacity today. The first step is to assess where you are with your provisioned capacity. If you don’t already have a CloudWatch dashboard that outlines provisioned vs. utilized capacity in your Dynamo tables, now would be a good time to create that. If you’re getting within 70% of your provisioned capacity, consider increasing it. This change is non-disruptive and will not cause downtime while the table scales up. You may also want to consider enabling Read and Write Autoscaling. While autoscaling Dynamo has the same challenges with Autoscaling EC2 (Slower scaling time, decreased performance for “overflow users” while scaling), it will ultimately allow you to scale if you get a short term spike in demand. For a more refined approach consider setting an alarm on Dynamo autoscaling; if your tables start autoscaling it should be a cue to the Operations team to reconsider the provisioned capacity of the Dynamo table.
Most AWS environments will have at least one of the following: EC2, RDS or Aurora, Lambda and APIGateway, or Dynamo. We’ve covered some basic ways you can prepare your systems using these technologies to scale under load. In future posts, I will do deeper dives into some more refined and advanced ways you can prepare your systems.
We understand that you have a lot on your plate especially with the uncertainty right now. If you need expert consultation on preparing your environments to meet increased demand, Mobiquity is here to help. Contact us for a consultation.
Give us your information below to start the conversation.