Category: Programming

Machine Learning – Basic Implementation (Part I – Linear Regression)

Machine Learning – Basic Implementation (Part I – Linear Regression)

Abstract

This post is a series of how to do basic implement specific machine learning algorithm to solve a problem. This implementation will help you, step-by-step, tune plenty of processes, in order, to optimise model.

Prerequisites

You should have overview mathematical of linear regression and concept of terms: features, gradient descent, cost function …
We also assume that all data be collected and cleaned before doing any implementation

Deployment

1. Normalize data
Base on the variety of data scale, we should normal data before do any visualisation . It can easy apply with python numpy library

import numpy as np
def normalize_features(df):
    """
    Normalize the features in the data set.
    """
    mu = df.mean()
    sigma = df.std()
    if (sigma == 0).any():
        raise Exception("One or more features had the same value for all samples, and thus could " + \
                        "not be normalized. Please do not include features with only a single value " + \
                        "in your model.")
    df_normalized = (df - df.mean()) / df.std()

    return df_normalized, mu, sigma

2. Visualize data
Of course, first steps to choose any model to train is to see how data (x,y) interact by graph. You will know linear regression to be correct algorithm if your data (or graph of your data) satisfy all criteria below:

  • The scatter of points has to be around the best-fit line and same standard deviation all long the curve. if there are many points too far (high or low) from best-line, that should be not a linear regression
  • The measurement of x (the features) should be exactly correct. Imprecision of measuring X (if happen) should be very small compared to biological variability of Y
  • The data input should be independent each other. it mean if there is a change in one data experiment, the others should not be change
  • Is a correlation between X and Y. for example: midterm score vs total score. while midterm score is a parameter (or component) to calculate total score, linear regression is not valid for these datas.

3. Gradient descent vs statmodel OLS
Before talking about tuning model, we get started with a basic step using gradient descent and statmodel OLS to find a first set of parameter theta. This set of theta might not be the best, but it provide overview step how to tune model later.

All cleaned data can be download from here: Turnstile Data of New York Subway
By gradient descent

def compute_cost(features, values, theta):

    m = len(values)
    sum_of_square_errors = np.square(np.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2 * m)
    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):

    m = len(values)
    cost_history = []

    for i in range(num_iterations):
        predict_v = np.dot(features, theta)
        theta = theta - alpha / m * np.dot((predict_v - values), features)
        cost = compute_cost(features, values, theta)
        cost_history.append(cost)
    return theta, pandas.Series(cost_history)


def predictions(dataframe):


    # Select Features (try different features!) - all feature to predict ENTRIESn_hourly]
    features = dataframe[['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']]
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = features.join(dummy_units)

    # Values - or y in model
    values = dataframe['ENTRIESn_hourly']
    m = len(values)

    features, mu, sigma = normalize_features(features)
    # Add a column of 1s (for theta0)
    features['ones'] = np.ones(m)

    # Convert features and values to numpy arrays
    features_array = np.array(features)
    values_array = np.array(values)

    # Set values for alpha, number of iterations.
    # learning rate
    alpha = 0.1
    # # of data set want to try
    num_iterations = 15000

    # Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array,
                                                            values_array,
                                                            theta_gradient_descent,
                                                            alpha,
                                                            num_iterations)

    predictions = np.dot(features_array, theta_gradient_descent)
    #coefficient of determination
    r = math.sqrt(compute_r_squared(values_array, predictions))

By stamodel OLS

def predictions(df_in):
    # select the features to use
    #feature_names = ['meantempi', 'Hour']
    feature_names = ['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']

    # initialize the Y values
    X = sm.add_constant(df_in[feature_names])
    Y = df_in['ENTRIESn_hourly']

    # initialize the X features by add dummy units, standardize, and add constant
    dummy_units = pd.get_dummies(df_in['UNIT'], prefix='unit')
    X = df_in[feature_names].join(dummy_units)
    X, mu, sigma = normalize_features(X)

    # add constant in model will improve a little bit
    X = sm.add_constant(X)

    # ordinary least squares model
    model = sm.OLS(Y, X)

    # fit the model
    results = model.fit()
    prediction = results.predict(X)

    return prediction
    Analysis

  • Both technique required add dummy_variable to isolate categorial features. This will help to improve a lot of model
  • add constant in statmodel (mean shift Y a constant value) is not really meaningless, but just improve model little bit
  • we both use coefficient determination (R) to validate model. close to 1 mean better model

(to be continued)

Designing scalable systems – Part 1: The Basics

Designing scalable systems – Part 1: The Basics

1. Introduction

Nowadays, web applications are becoming more and more popular. Large websites are serving billions of users everyday, with minimal to zero percent down-time. Since you are reading this article, there’s a high chance you are already running a large system or are going to build one.

In this blog series, I’m going to share my experience on how to design web applications for scalability. This first article is not intended to go into too much detail, but instead to give you a rough idea of what should be considered when designing a scalable web application. I’m going to share the way I think when designing a scalable system, and not the solution to any specific system.

This ideas mentioned in this blog series not only apply to building websites, but also to building applications and software systems in general.

Let the fun begin! 😀
 

2. The Basics

What is scalability anyway?

From Wikipedia:

Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.

Let’s look at an example. Our application may currently be running on a 2 CPUs with 8 GB memory instance, serving two million page views per day. What if that number of page views gets doubled by tomorrow, then ten times larger by next week, and then, a thousand times larger by the end of next month? Is our application prepared for that? What are the plans to handle the extra workloads? Are we going to upgrade our server to a larger one? Or are we going to buy more servers? Or is there anything else we are going to do?

If our application is prepared for the growth of users or page views or transactions, our application is scalable.
The plans that we prepare for our application to grow, or in other words, scale, with the growth of users or transactions, are called the scaling strategies.
Designing such a plan so that our application can scale, is designing for scalability.

Scalability, simply, is about doing what you do in a bigger way. Scaling a web application is all about allowing more people to use your application. If you can’t figure out how to improve performance while scaling out, it’s okay. And as long as you can scale to handle larger number of users it’s ok to have multiple single points of failures as well. — Royans K Tharakan

Vertical Scalability vs. Horizontal Scalability

We have two choices when it comes to scaling: Vertical and Horizontal.

  • Vertical Scalability – Vertical scaling means that scales by adding more power (CPU, RAM) in your existing machine. It basically means promoting an upgrade on the server. An example of this would be to add CPUs to an existing server, or expanding storage by adding hard drive on an existing RAID/SAN storage.
  • Horizontal Scalability – Horizontal scaling means that scales by adding more machines in your resource pool. It is the ability to increase the ability to connect multiple instances so that they function as a single logical unit. Basically, it means increasing the number of servers. Most clustering solutions, distributed file systems, load-balancers help you with horizontal scalability.
Vertical Scaling vs. Horizontal Scaling

Royans Tharakan wrote about this on his blog:

If you need scalability, urgently, going vertical is probably going to be the easiest (provided you have the bank balance to go with it). In most cases, without a line of code change, you might be able to drop in your application on a super-expensive 64 CPU server from Sun or HP and storage from EMC, Hitachi or Netapp and everything will be fine. For a while at least. Unfortunately Vertical scaling, gets more and more expensive as you grow.

Horizontal scalability, on the other hand doesn’t require you to buy more and more expensive servers. It’s meant to be scaled using commodity storage and server solutions. But Horizontal scalability isn’t cheap either. The application has to be built ground up to run on multiple servers as a single application. Two interesting problems which most application in a horizontally scalable world have to worry about are “Split brain” and “hardware failure“.

While infinite horizontal linear scalability is difficult to achieve, infinite vertical scalability is impossible. If you are building capacity for a pre-determined number of users, it might be wise to investigate vertical scalability. But if you are building a web application which could be used by millions, going vertical could be an expensive mistake.

But scalability is not just about CPU (processing power). For a successful scalable web application, all layers have to scale in equally. Which includes the storage layer (Clustered file systems, s3, etc.), the database layer (partitioning, federation), application layer (memcached, scaleout, terracota, tomcat clustering, etc.), the web layer, loadbalancer, firewall, etc. For example if you don’t have a way to implement multiple load balancers to handle your future web traffic load, it doesn’t really matter how much money and effort you put into horizontal scalability of the web layer. Your traffic will be limited to only what your load balancer can push.

Choosing the right kind of scalability depends on how much you want to scale and spend. In fact if someone says there is a “one size fits all” solution, don’t believe them. And if someone starts a “scalability” discussion in the next party you attend, please do ask them what they mean by scalability first.

What do we want to achieve in a scalable system?

Scalability (duh!)

A scalable system should be prepared for a lot more workloads in the future. We can upgrade the servers to larger ones, with more CPUs and memory. We can design the system so that it can be extended by adding more servers to the existing application cluster. There should always be a scaling strategy so that the system can adapt to the upcoming extra workloads.

Robustness

A scalable system should always be responsive and function correctly, even when the number of requests grows by a factor of thousands. After all, there’s no point adding more hardware resources if the system cannot function correctly.

High availability

The system is going to server millions or billions of users, all around the world. Lots of businesses may depend on our system. Our system, therefore, cannot afford a down-time. The system should always be available, even during system upgrades. When our application goes global, there’s no place for “night deploys”.

 

3. Methodologies

Although there are a lot of specific ways to scale a system, they can be generalized into some methodologies below.

Methodology #1: Splitting the system

If you can’t split it, you can’t scale it. — Randy Shoup

Splitting is one of the most common practices in designing a scalable system. The idea is that, since vertical scaling the whole system is limited by hardware capability, we need to split the system into components and scale each component separately.

For example, let’s say we are designing an e-commerce system. At first, we have our web application and our database on the same server. When the traffic grows, we plans to buy a larger server. If we put our system on the most powerful server at the moment, it can handle up to one million concurrent users. And that’s it. We cannot scale to another million users because there’s no more powerful server that we can buy.

The first thing we can do is to split the system so that the web application is put on one server and the database on another.
Then, we can clone the web application to put on multiple servers, all accessing the same database server.
Then, we can split the database into multiple databases, each containing several tables from the original database. The sub-databases can now be put on separate servers. In an e-commerce system, the database can be split in to product database, order database, fulfillment process database, user and authentication database, etc.

Of course, we cannot split things that easily if we didn’t design our system for that from the beginning. For example, if our application joins data from two tables, these two tables cannot be split into different servers. This little example can show us the importance of designing a system for scalability from the early days.

HDFS, MapReduce, Kafka, ElasticSearch, and many more applications are designed to be able to split and scale by adding more servers to the application cluster.
Facebook split their databases not only by tables, but also by rows. Data of users in each region are saved on different “region databases”, and are synced periodically to other “region databases”.
Lots of large systems nowadays are split into microservices, each of which takes care of one function in the system, so that the services can be scaled separately.

As you can see, designing ways that our system can be split plays an important role in making our system scalable.

Methodology #2: Detecting and Optimizing Bottlenecks

The limit of a system is the limit of its weakest link.

To make the system handle more workloads, we need to find the system’s weakest point and make that point handle more workloads.

Let’s think of an example. In our e-commerce system, we have 5 web servers and 1 database server, each hosted on a separate physical server instance. The web servers are running at about 5% of CPU on average, while the database server is always running at 95% of CPU. The bottleneck of the system in this case is the database server.

In the above example, there’s no point adding more web servers to the system. If the current setup can handle one million concurrent users at most, it is not likely that adding more web servers can help the system to handle more users. Instead, optimizing the database server may help.

As discussed in the previous part, to optimize bottleneck at the database server, we can buy a more powerful server and relocate the database into it. If that is not an option, we can try to split the tables on the database into serveral sub-databases on different server instances, but that would include some code modifying, and may not be an option either.

Taking a closer look at the resouce usage on the database server, we find out that most of the time, the CPUs are not doing the computation, but are instead waiting for the I/O requests to complete. We monitor the disk I/O, just to find out that the disk-write is always at 100 MB/s. Now we know that the real bottleneck is the disk I/O.

To optimize the disk I/O bottleneck, we can upgrade our HDDs into SSDs, or we can add more disks to the RAID system, or try to use a SAN. As long as we can provide a better I/O bandwidth, the whole system may benefit.

On the other hand, we can reduce the I/O request by optimizing database queries and indexes. If after creating some database indexes, the disk I/O rate reduces to 10 MB/s, we may not need to upgrade the database server anymore. There were also many times in my past projects, the reason was that the database doesn’t have enough memory to cache the queries, and strange it may sound, but adding more memory could solve the disk I/O problems.

After optimizing the database, our system can handle another million users, but looking at the resources usage, we now see that the web servers are using 99% of CPU all the time, while the database only uses less than 10% of CPU. This time, the web servers become the bottleneck. We can repeat the optimizing process with the web servers. We can add more server instances, or detect and optimize the bad code block that is causing the rise in CPU usage. The point is that if the weakest link in the system can handle more requests, the whole system can handle more.

Methodology #3: Detecting and Eliminating Single Point of Failure (SPOF)

Since we mentioned bottlenecks in the previous part, I thought it’s worth discussing Single Point of Failure too. This part is more on keeping our system high available than enabling it to handle more requests. In fact, most large systems serve a lot of people and lots of businesses may depend on them, so high availability is one of the most wanted requirements.

From Wikipedia:

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

In a traditional web application, we often have a web server reading and writing to a database. When a user open a browser and navigate to the website:

  • the browser sends a request to the web server,
  • the web server receives the request and gets data from the database or writes to it,
  • the web server responses to the browser with the result,
  • the browser renders the response to the screen.

In the above setup, if the web server breaks down (maybe due to hardware issues), the website is down. The user cannot connect to the website anymore. This web server is a Single Point of Failure, which means if it fails, the whole system fails.
The database server in this case is also a Single Point of Failure.

To make our system high available, which means the system can still function if some part of it goes down, we have to eliminate its Single Point of Failure.
The word “eliminate” doesn’t mean taking that part down, but instead, means trying to make that part no long the Single Point of Failure.

Back to our example, to eliminate the Single Point of Failure at the database, we can user the mirroring function of the database. We setup the database on 2 separate server instances, one as the master server and the other as the mirror server. If the master goes down, the mirror server will stand up to replace the master to make sure the web servers can still accessing the database.
For the web server, we setup another web server that function exactly the same as the first one. We setup a reverse proxy to load balance requests between the two web server. If a web server breaks down, the reverse proxy will detect and route all traffic to the remaining one.

We have eliminated two single point of failure in the system. However, we are introducing a new one: the reverse proxy.
In the new setup, the browser connects to the reverse proxy, the reverse proxy will then forwarding the request to the internal web server, wait for the response, then forward it back to the browser. If the reverse proxy goes down, the user still cannot access the website.

To eliminate this new single point of failure, we can setup a backup server for the reverse proxy and use a Virtual IP Address. The two reverse proxy server will continously check if the other is alive, and make sure that one of them is taking the Virtual IP Address. When the master reverse proxy goes down, the backup server will take the Virtual IP Address and take the job from the master.

Detecting and eliminating Single Point of Failure is no easy task in systems design. The example above is just a simple one to demonstrate the idea. We’ll have a whole blog on this later.

Methodology #4: Caching

I believe you’ve heard about caching and use it a lot in your projects. Let’s again look it up on Wikipedia:

In computing, a cache /ˈkÌʃ/ kash, is a hardware or software component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation, or the duplicate of data stored elsewhere.

A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests can be served from the cache, the faster the system performs.

If your system has a lot of data that doesn’t changes for short period of time, or if the change isn’t critical and it doesn’t hurt to serve the user with an old version of the data, caching is a good candidate that can optimize your system by ten or even a hundred times.
For example, a news website doesn’t need to change its news every second. People can read a 5-minute-ago version of the news without any critical problems.

There are many types of caching, all serve the same purpose: to make future requests for that data can be served faster. A caching solution can be a mix among the following caching strategies:

Caching in disk
After being computed for the first time, a web page’s content may be cached into a file in the server’s storage. Next time the web page is requested, the server does not have to recompute the content, but read it directly from the cached file and response to the user. Most modern web servers (Apache, Nginx, IIS) and web frameworks (Django, PHP, .NET MVC) support this type of caching.

Caching in memory
After the data is read from the disk or computed from the database, it is cached in memory so that the data can be read a lot faster in the next requests. This type of caching is often used to cache data objects, and also used by image or video hosting servers.

Caching objects
Instead of caching the whole web page content, the system can cache objects that were read from the database into memory, so that next time, it doesn’t have to query it again from the database. This type of caching is often used in large systems that data are read a lot more often than written.

Caching in database
Computations can also be cached in a database. If the application does a lot of aggregations on raw data, and the aggregations does not need to be 100% updated everytime it is requested, we can ease the stress for the database by precomputing the aggregations and cache it in a separate database table, instead of scanning the whole raw data table to do the aggregations everytime receiving a request.

Distributed caching
If we need to share cached data among web servers, we may need to apply a distributed caching service of some kind to store the cached data. For example if we cache users’ session data on web servers’ memory, and use a load balancer to round robin requests among these web servers, we may face the situation where a user login with web server 1, the session cookie is stored on web server 1’s memory. Later when that user refreshes the page, she gets routed to web server 2, which has no session data of her. The result is that the user appears to be logged out, although she just logged in 5 seconds ago. To overcome this type of situation, we need a distributed caching solution. It can be a network shared folder to keep the session file, or it can be a distributed memory caching solution like memcached or Redis.

Caching can be powerful but should be used with care. Improper use of caching may cause serious problems. Here are some examples:

  • Caching account balance in a credit system is not the smartest thing to do, because it can lead to the situation where the accounts are overcharged.
  • Another common mistake is caching web page responses on a reverse proxy, including the response header information. It may happen like this:
    • For example, Alice goes to myshop.com, logs in, then browses the detail page of product A.
    • Web server renders the web content for product A, and caches it as a html file on the server’s disk storage, including the response header that set Alice’s authentication cookie
    • Meanwhile, Bob also wants to browse product A.
    • Web server serves Bob with the web content from the cached file, including Alice’s authentication cookie in response header
    • Bob is now seeing product A, and unintentionally logged in as Alice.
    • Bob can now see Alice’s order history. If Bob buys something on the website, the system may record that Alice buys it. Later, customer service agent calls Alice to confirm the order but Alice doesn’t understand what happened

The second example may sound stupid, but it did happen in one of my past projects. The reverse proxy application at that time somehow cached some urls that it was not configured to cache, leading to the situation described above. Turning off the proxy’s “kernel caching mode”, without modifying any url configuration, made the problem disappear.

There’s a lot more about caching, but that would be out of the scope of this blog. We’ll have another blog on this topic.

Methodology #5: Indexing

Indexing is a way of storing data in a suitable structure, so that data retrieval can be fast and accurate.

Database index

Sometimes, caching is not applicable due to the nature of the business, like in a banking system, where transactional data must always be consistent.

Database queries can be optimized by adding database index to the table. Database index can improve query performance by a factor of thousands to millions times. In fact, the query running time can get from O(n) in a full table scan, down to O(logn) in a indexed table, where n is the number of records in the table. Let’s say we have a table of 10.000.000.000 (ten billion) records, and the table is well-indexed, the query will need to do only 10 compares to find the matching row. Behind the scene, the indexed data is stored in a b-tree data structure, but that’s out of the scope of this blog.

Database indexing not only speed up the query running time, but also reduces the disk I/O needed to return the matching records. In OLTP databases, faster queries lead to less locking time on the table.

I’ve seen many times in my past projects, where the insert/update to the database took too long to complete, but the real reason was not the insert/update itself. The real cause was another select query, which took too long to complete and locked the whole table during its execution, making the insert/update queue up in line. Adding a index to optimize the select query, the problem is gone. The insert/update can complete and return immediately as usual.

If you want to learn more about database index, try reading use-the-index-luke.

Search index

When it comes to searching, there’s another hero in town: search index.

Most databases support full-text index out of the box, but it’s hard to configure and does not scale very well. Search queries can be very complicated, like searching for a product with a keyword that should appear in its title or content, and the product must be in sports and clothes category, with a discount percent at least 50 percent. The database can be used to fulfill the search, but the performance would be terrible.

At the moment, Solr and ElasticSearch are the most popular search engines that are being used widely. They are fast, horizontally scalable, good at full-text search and handling complicated queries.

Using a search engine in our system can yield several benefits:

  • Search engines will take care of what it is best at: searching, while leaving the database to do what the database is best at: storing transactional data.
  • Most search engines are design to scale very well horizontally. Therefore, by using a search engine, our system’s search function are already prepared for scaling. For example, ElasticSearch can be deployed in cluster of multiple nodes. When we need extra performance from the cluster, we can add more nodes to the cluster, or we can just increase the number of replicas of the index, all of which can be done online without shutting down the service.

    Increasing number of replicas will enable more nodes to store the same piece of data, and hence ElasticSearch can load balance the query to get the result from more nodes, which leads to the increase in query performance.

    Think of what would happen if we do the search directly from the database. The scaling would be a disaster.

With all the benefits mentioned above, using a search engine will give you some extra work to do, like to index or synchronize the data from the database to the search engine. But if your system is going to scale, the benefits are going to worth the extra effort.

Methodology #6: Classifying and Prioritizing Data

Not all data are created equal.

When our system grows too large and too complicated, we may not be able to work out a way to scale all our data together. Facebook has more than a billion users, each user has hundreds to thousand of friends, each friend has several status updates and activities every day. If everytime a user create a status, we have to notify all of the friends about that new activity in one database transaction, no system would be above to handle the workload, and even if it can, the user would have to wait very long before his or her status post completes, just because he or she has more than a thousand friends to update along the way.

Real-time vs. Near real-time vs. Offline

During a Facebook developer event, talking about how Facebook partitioned the users to multiple databases, each database containing a subset of users, and synchronize the “activity feeds” among databases, the speaker was asked: “How did Facebook keep the data consitent among databases?”. At that time, the answer blew my mind: “Fortunately, we don’t have to be consistent at all”.

  • In Facebook’s case, of course it’s good to notify the friends about a user’s new status right away, but it at the same time doesn’t hurt if the notification is 5 minutes later or even an hour later, not to mention half of those friends may not even be online by that time.
  • In an e-commerce system, most of the reports aren’t necessary to be real-time. Some of the reports can even be updated weekly or monthly (yes, I’m talking about those weekly and monthly sales performance reports :D)

Before designing a solution to scale the system, we should first classify each data into one of these types: real-time, near real-time, and offline.

  • Real-time: These data need to always be consistent. In other words, everytime the data of this type is read, it must be the latest and the only version of the data. Data like bank account balance, warehouse stock quantity, GPS location in navigating system, chat messages should be in this category.
  • Near real-time: These data can afford to be a little bit late, for example five minutes or even one hour. For example, in most of the case, emails can be a little bit late without hurting anyone. Activity feed in the above section, can also be classified as near real-time. Of course, it’s always better if these data can be real-time, but if the cost to be real-time is too high, these data can fall back to be near real-time with mininal to zero negative impact. In fact, most people don’t care or even notice a delay in the activity notification.
  • Offline: These data do not need to be updated regularly. They can be updated in batches at the end of the day or at the end of a week, when the system is not heavily used. For example, in an e-commerce system, reconciliation reports can be exported once a day at night, when the traffic to the website is not that critical and the system resources are free.

Know the priority of our data, we can decide how each data should be stored and should be scaled. Real-time data should be stored in a transactional database, while near real-time data can put in a queue waiting to be processed with a small delay. Offline data can be put in a replicated database, which is synchronized once a day during low traffic hours. The system then can use significantly less resources while still be able to fulfill the business requirements.

Read-heavy vs. Write-heavy

Beside the above mentioned classification, we should also take into consideration whether our data is read-heavy or write-heavy

  • Read-heavy: These data are read a lot more frequently than written or updated. For example, articles in a newspaper or a blog are rarely updated, but are read very frequently. For these type of data, we can use caching or replication to enable the system to read more in less time.
  • Write-heavy: These data are written a lot more frequently than read. For example, access logs in a system or bank transactions. For these type of data, caching or replication may not be helpful. In fact, caching or replication may even hurt the system performance, since the system have to do more works every time it has to write something, while it can rarely read a data from cache. The data may have been changed a hundred times before it is read from cache one time.

Other classifications

Above are just two ways of classify and prioritize data before desiging an appropriate scaling stategy for each type of data. In practice, we can think of more ways to classify the data. The point is, we should classify the data based on business needs and select an appropriate strategy so that the system can use less resources and can still meet the business requirements.

 

4. Where to go from here?

The above topics are just some of the most basic topics on designing a scalable system. There’s a lot of things that I haven’t learned or haven’t even known of. I’m going to list some topics that can be helpful when designing a scaling strategy. Hope some of them can be helpful to you. The more we know, the higher the chance we can find a good scaling solution for our system. The below topics are not listed in any intended order.

  • CAP theorem
  • How to become horizontally scalable in every layer
  • Vertical partitioning vs. horizontal partitioning
  • Read-heavy vs. write heavy
  • Clustering: partitioning vs. replication
  • Real-time vs. near real-time vs offline
  • How to make web server horizontally scalable using reverse proxy
  • How to make reverse proxy horizontally scalable
  • How to make database horizontally scalable
    • Shared nothing database cluster vs. shared storage database cluster
    • MySQL NDB vs. Percona vs. Oracle Database Cluster vs. SQL Server Cluster
  • Using cloud database service vs. scaling self-hosted database
  • Scaling system on cloud hosted environment vs. self hosted environment
  • DNS round robin
  • Caching
    • Disk caching
    • Local memory caching
    • Distributed memory caching
      • memcached
      • Redis
  • Hardware scaling
    • RAID storage
    • SAN storage
    • Fiber network interface
  • Network capacity consideration
    • Local cache vs. network distributed cache
    • local pre-compute to reduce network traffic (e.g. MapReduce Combiner)
  • Storage scaling
    • Increasing reading bandwidth (RAID, replication, memory caching, distributed network storage, hdfs, etc.)
    • Increasing writing bandwidth (RAID, partitioning, memory write buffer, distributed network storage, hdfs, etc.)
    • Scaling storage capacity (RAID, distributed network storage, data compression, etc.)
  • Database/datawarehouse optimization
    • Database index
    • Column store index vs. row store index
  • Search optimization (ElasticSearch, Solr)
  • Peak-time preparation strategies (cloud vs. self-hosting, AWS Auto Scaling, Google Cloud Autoscaler
  • Cost optimization
    • Google Cloud Preemptible, AWS Spot Instance
    • Free CDN: CloudFlare, Incapsula
  • Resources monitoring, debugging, troubleshooting
  • Automate everything
  • Load test vs. stress test

 

Conclusion

When it comes to scaling, there’s no magical solution that can tackle all problems on all systems. Knowing the basics and the methodologies, we can do the analysis and find the most suitable solution, and prioritize what can be done first that will results in the biggest impact.

On the other hand, there’s no final destination in optimizing system’s scalability. Whenever we sense that our system is reaching its limit, we should work out a better solution to scale it up before it’s too late. However, if there’s no sign that the system will get overloaded anytime soon, I believe you’ll always have better works to do than further optmizing the system. If there’s only 10 billion people on earth, there’s no point designing a system for 100 billion concurrent users.

This article can no way cover everything in designing scalable systems. But I hope some of it might be helpful to you. What’s the most interesting experience you’ve got when scaling your system? Share with us in the comment! 😀

Sorry for the long post!

Javascript Promises – Part III – Chaining Promise’s

Javascript Promises – Part III – Chaining Promise’s

In the previous parts, we have learned about Promise and Promise error handling.

In this part, we are going to learning about chaining Promise‘s.

Let’s take a look at the following code:

var dogs = ['dog1', 'dog2', 'dog3', 'dog4', 'dog5', 'dog6'];
function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

function displayNames(names) {
    var currentName = names.shift();        // take on name out to process
    if (currentName) {                      // if not end of array
        printName(currentName).then(function(name) {
            console.log('printName succeeded: ' + name);
            displayNames(names);
        }).catch(function(name) {
            console.log('printName failed: ' + name);
            displayNames(names);
        });
        
    }
}

displayNames(dogs);

In the example above, we are trying to print the list of dogs’ names in a consequence order, which means name after name, not at the same time.

As you already know, in Javascript, functions return immediately after it is called, and a moment later, the callback function is called when the job actually finished. This leads to a situation where we can’t do a list of jobs one after another.

Promise give us the power to do just what we want by chaining promises together.
When promises are chained together, the Javascript engine will only call the second promise’s job when the first promise has been either resolved or rejected. That way the promise’s jobs will be executed one after another.

In the example above, we are using recursion to make the code work in consequence. However, using promise, we can make the code simpler and more expandable.
Let’s look at the easy version first:

var dogs = ['dog1', 'dog2', 'dog3', 'dog4', 'dog5', 'dog6'];
function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

printName('dog1')
    .then(function(name){
        console.log('printName succeeded: ' + name);
        printName('dog2').then(function(name){
            console.log('printName succeeded: ' + name);
        });
    });

Instead of writing code like this, we can leverage the power of chaining Promise and rewrite the code as below:

var dogs = ['dog1', 'dog2', 'dog3', 'dog4', 'dog5', 'dog6'];
function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

printName('dog1')
    .then(function(name){
        console.log('printName succeeded: ' + name);
        return printName('dog2');
    })
    .then(function(name){
        console.log('printName succeeded: ' + name);
    });

By chaining the promises like above, we are able to print dog2 after dog1. The key point is, instead of passing the resolve and reject function to printName('dog2').then(), we can just return the printName('dog2') as a Promise, and the next .then() just do its job in exactly the same manner as if it was passed to printName('dog2').then(). This works because Promise provide us with the chaining syntax to make the code more comprehensive.

Imagine that instead of creating a chain like this p1 -> (p2 -> (p3 -> p4)), we can now create a chain like p1 -> p2 -> p3 -> p4, which functions exactly the same.

But what if we have a lot of dogs? What if the dog list is dynamically generated? We just can’t know what dogs are there to print. So we must work out a way to dynamically make promise’s chain.

Let’s look at the upgraded version:

var dogs = ['dog1', 'dog2', 'dog3', 'dog4', 'dog5', 'dog6'];
function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

var sequence = Promise.resolve();

dogs.forEach(function(name) {
    sequence = sequence.then(function() {
        return printName(name);
    }).then(function(name) {
        console.log('printName succeeded: ' + name);
    }).catch(function() {
        console.log('printName failed: ' + name);
    })
})

In the code above, we use a very interesting trick by writing var sequence = Promise.resolve();, which basically creates an empty promise that has been resolved to use as the beginning promise of the chain. After this, we can add the next promise to the chain easily just by using sequence.then()

Another thing that worth mentioning here is that if we use for instead of forEach, be careful because the index variable will not be the same by the time the promise’s resolve function is called as intended when creating Promise chain.
The code should look like this:

var dogs = ['dog1', 'dog2', 'dog3', 'dog4', 'dog5', 'dog6'];
function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

var sequence = Promise.resolve();

for (var i = 0; i < dogs.length; i++) {
    (function() {       // define closure to capture i at each step of loop
        var capturedIndex = i;
        sequence = sequence.then(function() {
            return printName(dogs[capturedIndex]);
        }).then(function(name) {
            console.log('printName succeeded: ' + name);
        }).catch(function() {
            console.log('printName failed: ' + name);
        })
    }())    // invoke closure function immediately
}

Creating an array of promises

Instead of chaining promises together, we can also create an array of promises and wait for all the promises to finish by using Promise.all() like this:

function printName (name) {
    return new Promise(function(resolve, reject) {
        console.log('printName: ' + name);
        resolve(name);
    });
}

var promiseArray = [ printName('dog1'), printName('dog2')];

Promise.all(promiseArray).then(function(names){
    console.log(names);
}).catch(function(names){
    console.log(names);
})

Promise.all will wait for all the promises to finish and then call the final resolve function with the list of PromiseValue of resolved promises in array, as well as call final reject function with the list of PromiseValue of rejected promises in array. In other words, if promise array has 5 promises, 3 of which succeeded and 2 of which failed, then the final resolve function will be called with a list of 3 names and the final reject function will be called with a list of 2 names.

Congratulations, you’ve made it!
I hope you understand what promise is and how it works now.

If my explanation doesn’t seem to work for you, you might want to try reading this very good tutorial here.

If you have any thoughts or questions, please share your comment below.

Cheers!

Javascript Promises – Part II – Handling errors

Javascript Promises – Part II – Handling errors

In the previous article, we’ve learned what Promise is and went through some basic use cases of Promise. In this article, we will learn how to handle errors while using Promise.

If you haven’t read part I, I would recommend that you take a look at it here.

If you’ve read part I, you may have wondered when the reject function would be called and how to specify a real reject function to the Promise.

Actually, in the previous part, to make life simple, we’ve ignored the reject function. The full version of .then() should come with 2 parameter: resolve and reject function.

Reject function

To be clearer on this, let’s open our cool console tab in chrome and type the following code

var p1 = new Promise(function(resolve, reject) {
    console.log('1');
    throw 'Uh-oh!';
});

p1;

The console outputs would be like below:

Promise { [[PromiseStatus]]: "rejected", [[PromiseValue]]: "Uh-oh!" }

Ok, something have changed. p1‘s status is now "rejected" and its PromiseValue is now "Uh-oh!".
What happened was that during its execution, p1 met an error (exception) and it cannot be resolved, but was rejected. Thus its status is "rejected" and the exception is saved in the PromiseValue, so that it can be passed to the “reject function” if specified.

Now let’s try to specify the resolve function as we did in the previous part:

var p2 = p1.then(function(val) {
    console.log('2: ' + val);
    return val;
});

p2;

The console outputs would be:

Promise { [[PromiseStatus]]: "rejected", [[PromiseValue]]: "Uh-oh!" }

p2 status is also "rejected" and the PromiseValue is forwarded to p2. More over, we can see that the “resolve function” was not called since there was no console log output.

Let’s change p2 a little bit by adding a “reject function” to .then():

var p2 = p1.then(
    function(val) {
        console.log('resolve 1: ' + val);
        return val;
    },
    function(err) {
        console.log('reject 1: ' + err);
        return err;
    }
);

p2;

The console outputs would now be:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "Uh-oh!" }

p2 is now "resolved", the console logs "reject 1: Uh-oh!" and the PromiseValue is "Uh-oh!", which is because we called return err; in the reject function.
In the code above, we pass 2 params to .then(), the first one is the “resolve function” to call when p1 succeeds, the second one is the “reject function” to call when p1 fails. Since p1 is rejected, the reject function is called. There’s no error during execution of p2 so p2 status is resolved.
Now if we run the following code:

var p3 = p2.then(
    function(val) {
        console.log('resolve 2: ' + val);
        return val;
    },
    function(err) {
        console.log('reject 2: ' + err);
        return err;
    }
);

p3;

The console outputs would be as below:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "Uh-oh!" }

We can see that the “resolve function” was called and p3 is resolved .

then() and catch()

To make life even simpler and code even clearer, Promise provides us with another helpful function: catch()
Instead of writing this:

var p2 = p1.then(
    function(val) {
        console.log('resolve 1: ' + val);
        return val;
    },
    function(err) {
        console.log('reject 1: ' + err);
        return err;
    }
);

p2;

We can now write this:

var p2 = p1
    .then(function(val) {
        console.log('resolve 1: ' + val);
        return val;
    })
    .catch(function(err) {
         console.log('reject 1: ' + err);
         return err;
    })

p2;

Which will basically execute and output the same thing:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "Uh-oh!" }

Technically, the call to .catch(rejectFunction) is equal to calling .then(undefined, rejectFunction) and in the the above code, there was actually another Promise in between, right after the first .then(). So the above code is actually:

var p1_1 = p1
    .then(function(val) {
        console.log('resolve 1: ' + val);
        return val;
    });

p1_1;

var p2 = p1_1.catch(function(err) {
    console.log('reject 1: ' + err);
    return err;
})

p2;

One good thing about Promise is that if an error is not handled by a promise, it make the promise rejected and keep forwarding the error to the next promise. Therefore in the above code, p1 does not have a reject function so p1_1‘s status is rejected and the error get passed to the next promise. Since p1_1 has a reject function (specified by .catch()), p2 is resolved.

If we don’t use .catch(), and specified the reject function inside of a .then(), then we must specified the next nested layer of resolve and reject functions inside of the first reject function, and if this keeps going on, we would end up in the same callback hell as when we don’t use Promise at all.

By using .catch(), the error handling would also be much simpler when we chain promise after promise like this:

var p2 = p1
    .then(resolveFunc1)
    .then(resolveFunc2)
    .then(resolveFunc3)
    .then(resolveFunc4)
    .catch(rejectFunc);

When we write like this, if error happens during any node of the promise chain, the rejectFunc will be called.
We can also catch the error for a particular promise separately like this:

var p2 = p1
    .then(resolveFunc1)
    .then(resolveFunc2)
    .catch(rejectFunc2)
    .then(resolveFunc3)
    .then(resolveFunc4)
    .catch(rejectFunc);

In this part, you’ve learned how to handle errors using reject function and .catch(). We will discuss more on Promise chain in the next part of this blog series. That’s it for part II.

Next in series: Javascript Promises – Part III – Chaining Promise’s

Javascript Promises – Part I – Getting started

Javascript Promises – Part I – Getting started

Have you ever seen those javascript codes that keep adding .then() and .then() all over the place and you wonder what the hell is happenning? You are not alone my friend. Today I’m going to talk about these things called Promise and tell you what, after you understand this Promise thing, you’ll find it one of the coolest thing Javascript has ever made.

Who should read this article?

This article applies to both javascript on the browser and javascript in Node.JS. So anyone who uses either of those may find this article helpful.

Why Promise?

Back in history, Javascript is all about callbacks. When we tell the code to do some task, we don’t wait for it to finish, but instead, we pass a callback to it so that the code can call the callback when the task is done.
Now if you want the callback above to do another task and then do something with the result from this second task, you add another callback to the previous callback. And when this adds up, you may get 10 layers of callback in your code and now you can’t even understand your own code flow. This situation is usually referred to as “callback hell”.

Getting started

Ok, let’s get down to business. At the heart of JavaScript Promise is the Promise constructor function, which looks like this:

var mypromise = new Promise(function(resolve, reject){
    // asynchronous code to run here
    // call resolve() to indicate task successfully completed
    // call reject() to indicate task has failed 
});

We can think of it this way: each Promise has a “main” function, which is the code that actually does the job. This “main” function takes two parameters including a resolve function – which main will call when the job is successfully done; and a reject function – which main will call if there is any error while executing the job.

Let’s see the whole thing in action to better understand it.

Introduction to Promise

This is how a code with Promise might look like:

function getImage(url){
    return new Promise(function(resolve, reject){
        var img = new Image()
        img.onload = function(){
            resolve(url);
        };
        img.onerror = function(){
            reject(url);
        };
        img.src = url;
    });
}

What just happened? We’ve just created a function which returns a Promise. The Promise loads an image; and then if the loading succeeds, calls resolve with url as parameter and if failed, calls reject, also with url as parameter.

It’s normal if you don’t get the idea up until now. We’ll try to explain more in details in the several next paragraphs.

Now we can call this function like below:

getImage('doggy.jpg').then(function(successurl){
    console.log('successfully loaded: ' + successurl);
});

Let’s first understand what’ve just happenned.
After getImage return a Promise, we provide that Promise with a resolve function. The Javascript engine will then pass the url to the resolve function as stated in the main function of the Promise:

img.onload = function(){
    resolve(url);
};

Now that we have rough idea of what a Promise looks like, let’s take a look at a simpler example.

The easiest way to get a visualization on this is by opening a Google Chrome browser and open Developer Tools panel (usually by pressing F12 on Windows, or Command-Alt-I on Mac, or from the Menu -> More tools -> Developer tools). Navigate to console tab and let’s do the business.

Ok you don’t need incognito mode. I just open incognito mode out of habit, if you know what I mean ;))

Creating your first Promise

Paste this in your chrome’s console:

var p1 = new Promise(function(resolve, reject) {
    console.log('1');
    return 'value 1';
});

p1;

We’ve just create a Promise named p1 and then print it to console to see its status, which results in something like this in the console:

Promise { [[PromiseStatus]]: "pending", [[PromiseValue]]: undefined }

So p1‘s status is pending, and no PromiseValue is returned yet. That is because we haven’t called resolve nor reject while executing the main function
Let’s change p1 a little bit

var p1 = new Promise(function(resolve, reject) {
    console.log('1');
    resolve('value 1');
});

p1;

Notice that this time we’ve changed return 'value 1'; in the previous code block to resolve('value 1');. The console now returns with:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "value 1" }

Yay! The promise is now resolved, and PromiseValue is now "value 1".
Now let’s type this into your console:

var p2 = p1.then(function(val) {
    console.log('2: ' + val);
    return val;
});

p2;

The console returns with:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "value 1" }

What just happened? By calling p1.then, we specify the resolve function for p1. So now when the main function of p1 compeletes, it knows which resolve function to call. So now it call the function specified in then(), and pass p1.PromiseValue to that function as param (in this case val).

But wait, p1 already finished before the real resolve function was passed to p1.then. How did that resolve function be called?

So, Promise is a mechanism provided by javascript engine, and Javascript Engine is the one who called that resolve function. Let’s imagine that Javascript Engine has a timer that continuously check the status of Promise p1. When p1 finishes, it updates the p1.status to either resolved or rejected, and save the <PromiseValue so that it will use to pass to resolve or reject function later on. Then it checks if a real resolve function is specified. If no resolve function is specified, it just leave the Promise there and recheck in its next timer loop. Half an hour later, someone specifies the real resolve function for p1 by calling p1.then(resolveFunc). In its next timer loop, Javascript Engine finds out that p1 now has a real resolve function so Javascript Engine calls that function with p1.PromiseValue as the function’s first parameter.

Another fancy thing to notice in the previous example is that p2 is also a Promise. Technically, to return a Promise, that block of code should be rewritten as below:

var p2 = p1.then(function(val) {
    console.log('2: ' + val);
    return new Promise(function (resolve, reject) {
        resolve(val);
    })
});

By default, Promise.then() returns a Promise. But Promise.then() is smart enough to check if the passed in function returns a value, it will wrap that function into a Promise and the returned value of that function will be the PromiseValue. On the other hand, if the passed in function returns a Promise, Promise.then() will just forward that Promise as its returned object.

Therefore, in the previous block of code, when .then() finds out that the passed in function just return val; it wrap that function into a Promise and return the Promise instead. Later, when that function finishes, it knows that it would use the value returned from the function to assign to PromiseValue of the returned Promise (p2).

Now that p2 is a Promise, we can continue to use then() on it as below:

var p3 = p2.then(function(val) {
    console.log('3: ' + val);
    return val;
});

p3;

The console output should be like this:

Promise { [[PromiseStatus]]: "resolved", [[PromiseValue]]: "value 1" }

which means p3 is also a Promise, it has been resolved and its PromiseValue is "value 1", waiting to be passed on to the next Promise if there is any.

Ok that’s it for part I. Now you know what Promise is and what those .then() functions mean.

In the next parts, we will look more into error handling in Promise and how Promise‘s are chained.

Next in series: Javascript Promises – Part II – Handling errors