Author: Duy Truong

Machine Learning – Basic Implementation (Part I – Linear Regression)

Machine Learning – Basic Implementation (Part I – Linear Regression)


This post is a series of how to do basic implement specific machine learning algorithm to solve a problem. This implementation will help you, step-by-step, tune plenty of processes, in order, to optimise model.


You should have overview mathematical of linear regression and concept of terms: features, gradient descent, cost function …
We also assume that all data be collected and cleaned before doing any implementation


1. Normalize data
Base on the variety of data scale, we should normal data before do any visualisation . It can easy apply with python numpy library

import numpy as np
def normalize_features(df):
    Normalize the features in the data set.
    mu = df.mean()
    sigma = df.std()
    if (sigma == 0).any():
        raise Exception("One or more features had the same value for all samples, and thus could " + \
                        "not be normalized. Please do not include features with only a single value " + \
                        "in your model.")
    df_normalized = (df - df.mean()) / df.std()

    return df_normalized, mu, sigma

2. Visualize data
Of course, first steps to choose any model to train is to see how data (x,y) interact by graph. You will know linear regression to be correct algorithm if your data (or graph of your data) satisfy all criteria below:

  • The scatter of points has to be around the best-fit line and same standard deviation all long the curve. if there are many points too far (high or low) from best-line, that should be not a linear regression
  • The measurement of x (the features) should be exactly correct. Imprecision of measuring X (if happen) should be very small compared to biological variability of Y
  • The data input should be independent each other. it mean if there is a change in one data experiment, the others should not be change
  • Is a correlation between X and Y. for example: midterm score vs total score. while midterm score is a parameter (or component) to calculate total score, linear regression is not valid for these datas.

3. Gradient descent vs statmodel OLS
Before talking about tuning model, we get started with a basic step using gradient descent and statmodel OLS to find a first set of parameter theta. This set of theta might not be the best, but it provide overview step how to tune model later.

All cleaned data can be download from here: Turnstile Data of New York Subway
By gradient descent

def compute_cost(features, values, theta):

    m = len(values)
    sum_of_square_errors = np.square(, theta) - values).sum()
    cost = sum_of_square_errors / (2 * m)
    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):

    m = len(values)
    cost_history = []

    for i in range(num_iterations):
        predict_v =, theta)
        theta = theta - alpha / m * - values), features)
        cost = compute_cost(features, values, theta)
    return theta, pandas.Series(cost_history)

def predictions(dataframe):

    # Select Features (try different features!) - all feature to predict ENTRIESn_hourly]
    features = dataframe[['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']]
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = features.join(dummy_units)

    # Values - or y in model
    values = dataframe['ENTRIESn_hourly']
    m = len(values)

    features, mu, sigma = normalize_features(features)
    # Add a column of 1s (for theta0)
    features['ones'] = np.ones(m)

    # Convert features and values to numpy arrays
    features_array = np.array(features)
    values_array = np.array(values)

    # Set values for alpha, number of iterations.
    # learning rate
    alpha = 0.1
    # # of data set want to try
    num_iterations = 15000

    # Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array,

    predictions =, theta_gradient_descent)
    #coefficient of determination
    r = math.sqrt(compute_r_squared(values_array, predictions))

By stamodel OLS

def predictions(df_in):
    # select the features to use
    #feature_names = ['meantempi', 'Hour']
    feature_names = ['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']

    # initialize the Y values
    X = sm.add_constant(df_in[feature_names])
    Y = df_in['ENTRIESn_hourly']

    # initialize the X features by add dummy units, standardize, and add constant
    dummy_units = pd.get_dummies(df_in['UNIT'], prefix='unit')
    X = df_in[feature_names].join(dummy_units)
    X, mu, sigma = normalize_features(X)

    # add constant in model will improve a little bit
    X = sm.add_constant(X)

    # ordinary least squares model
    model = sm.OLS(Y, X)

    # fit the model
    results =
    prediction = results.predict(X)

    return prediction

  • Both technique required add dummy_variable to isolate categorial features. This will help to improve a lot of model
  • add constant in statmodel (mean shift Y a constant value) is not really meaningless, but just improve model little bit
  • we both use coefficient determination (R) to validate model. close to 1 mean better model

(to be continued)

HA Proxy using VIP and keepalived

HA Proxy using VIP and keepalived



This post discusses how to leverage keepalived features to proxy request(s) (both internal and external) with only 2 proxy servers, without forfeiting high availability.


2 installed CentOS with NginX server, a spare LAN IP, and a spare WAN IP from your cloud service.


1. Install keepalived (if not already present):

yum install keepalived

2. Bind IP which not defined in system (kernel level)
This step help kernel understand that a interface can have 2 ip address.
Add this to /etc/sysctl.conf:

net.ipv4.ip_nonlocal_bind = 1

Force sysctl to apply new setting:

sysctl -p

3. Configure keepalived at BOTH proxy
Edit /etc/keepalived/keepalived.conf with the content below (there is a default config file when installed, ignore it):

vrrp_sync_group VG_1 {
    group { WAN_1 }
    group { LAN_1 }

vrrp_instance WAN_1 { #master WAN
    #just a name
    state MASTER # BACKUP in other proxy 
    interface eth0
    virtual_router_id 3

    priority 90 # should be <90 in other proxy

    preempt_delay 30
    garp_master_delay 1
    advert_int 2
    authentication {
        auth_type PASS
        auth_pass yourpass
    track_interface {
    virtual_ipaddress { dev eth2

vrrp_instance LAN_1 { #backup LAN
    #just a name
    state BACKUP #MASTER in other proxy

    interface eth0
    virtual_router_id 4

    priority 80 # should be >90 in other proxy

    preempt_delay 30
    garp_master_delay 1
    advert_int 2
    authentication {
        auth_type PASS
        auth_pass yourpass
    track_interface {
    virtual_ipaddress { dev eth1

4. Set BOOTPROTO=”static” in both 4 interface.
In some cloud environment, there is a periodically restart on interface (still get same IP address), but it would not start keepalived service if it’s dynamic in WAN interface
5. set chkconfig keepalived on (auto service when booted)
6. use ip addr show <interface> (with eth1 or eth0 in both proxy to check status)


  • there are maximum 255 allowed instances on a proxy using vrrp
  • ¬†instance WAN_1 consider proxy 1 as master of WAN traffic (any request to will be proxied to web1 and web2 (based on nginx configuration of upstream, there are several load-balancing mechanism to apply (ip hash, round-robin, least-connect). if proxy 1 fail over, proxy 2 will take Master role of WAN traffic and proxied.
  • instance LAN_1 consider proxy 2 as master LAN traffic (DB request/return in this scenario – to
  • either one of proxy is fail over (completely off in all interface), another will take role Master both LAN and WAN traffic, which mean eliminate Single Point Failure
  • flow of traffic:
    • http request from user will point to VIP WAN –> e2 (proxy 1) default –> load balancing to both web server through e1 (proxy 1)
    • e1 (proxy 2) ALWAYS ready to forward load balancing http to both web server, but there is no input traffic to e2 (proxy 2) until it claims VIP WAN; therefore e1 (proxy 2) is in idle to forward http traffic to both webs
    • db request from web server will point to VIP LAN –> e1 (proxy 2) default–> load balancing to both db server through e1 (proxy 2)
    • e1 (proxy 1) ALWAYS ready to forward load balance to both db server, but there is no input db traffic to e1 (proxy 1) until it claims VIP LAN
  • Important parameter(s):
    • interface : interface to use exchange packet of vrrp protocols for every instances. should be local ethernet interface both case
    • priority: lower is BACKUP, higher is MASTER for each vrrp instance
    • dont_track_primary: we use local ethernet interface to exchange vrrp information and need another interface to healthcheck other side interface (track_interface parameter). Checking primary interface health can cause an issue due sleep state of interface (but not completely fail over)
    • virtual_ipaddress : the address should correspondence master interface take
  • Discussion: How about only e1 (proxy 1) fail over (other interfaces still work)? and solution ? Please leave your comment