Machine Learning – Basic Implementation (Part I – Linear Regression)

Machine Learning – Basic Implementation (Part I – Linear Regression)


This post is a series of how to do basic implement specific machine learning algorithm to solve a problem. This implementation will help you, step-by-step, tune plenty of processes, in order, to optimise model.


You should have overview mathematical of linear regression and concept of terms: features, gradient descent, cost function …
We also assume that all data be collected and cleaned before doing any implementation


1. Normalize data
Base on the variety of data scale, we should normal data before do any visualisation . It can easy apply with python numpy library

import numpy as np
def normalize_features(df):
    Normalize the features in the data set.
    mu = df.mean()
    sigma = df.std()
    if (sigma == 0).any():
        raise Exception("One or more features had the same value for all samples, and thus could " + \
                        "not be normalized. Please do not include features with only a single value " + \
                        "in your model.")
    df_normalized = (df - df.mean()) / df.std()

    return df_normalized, mu, sigma

2. Visualize data
Of course, first steps to choose any model to train is to see how data (x,y) interact by graph. You will know linear regression to be correct algorithm if your data (or graph of your data) satisfy all criteria below:

  • The scatter of points has to be around the best-fit line and same standard deviation all long the curve. if there are many points too far (high or low) from best-line, that should be not a linear regression
  • The measurement of x (the features) should be exactly correct. Imprecision of measuring X (if happen) should be very small compared to biological variability of Y
  • The data input should be independent each other. it mean if there is a change in one data experiment, the others should not be change
  • Is a correlation between X and Y. for example: midterm score vs total score. while midterm score is a parameter (or component) to calculate total score, linear regression is not valid for these datas.

3. Gradient descent vs statmodel OLS
Before talking about tuning model, we get started with a basic step using gradient descent and statmodel OLS to find a first set of parameter theta. This set of theta might not be the best, but it provide overview step how to tune model later.

All cleaned data can be download from here: Turnstile Data of New York Subway
By gradient descent

def compute_cost(features, values, theta):

    m = len(values)
    sum_of_square_errors = np.square(, theta) - values).sum()
    cost = sum_of_square_errors / (2 * m)
    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):

    m = len(values)
    cost_history = []

    for i in range(num_iterations):
        predict_v =, theta)
        theta = theta - alpha / m * - values), features)
        cost = compute_cost(features, values, theta)
    return theta, pandas.Series(cost_history)

def predictions(dataframe):

    # Select Features (try different features!) - all feature to predict ENTRIESn_hourly]
    features = dataframe[['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']]
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = features.join(dummy_units)

    # Values - or y in model
    values = dataframe['ENTRIESn_hourly']
    m = len(values)

    features, mu, sigma = normalize_features(features)
    # Add a column of 1s (for theta0)
    features['ones'] = np.ones(m)

    # Convert features and values to numpy arrays
    features_array = np.array(features)
    values_array = np.array(values)

    # Set values for alpha, number of iterations.
    # learning rate
    alpha = 0.1
    # # of data set want to try
    num_iterations = 15000

    # Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array,

    predictions =, theta_gradient_descent)
    #coefficient of determination
    r = math.sqrt(compute_r_squared(values_array, predictions))

By stamodel OLS

def predictions(df_in):
    # select the features to use
    #feature_names = ['meantempi', 'Hour']
    feature_names = ['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']

    # initialize the Y values
    X = sm.add_constant(df_in[feature_names])
    Y = df_in['ENTRIESn_hourly']

    # initialize the X features by add dummy units, standardize, and add constant
    dummy_units = pd.get_dummies(df_in['UNIT'], prefix='unit')
    X = df_in[feature_names].join(dummy_units)
    X, mu, sigma = normalize_features(X)

    # add constant in model will improve a little bit
    X = sm.add_constant(X)

    # ordinary least squares model
    model = sm.OLS(Y, X)

    # fit the model
    results =
    prediction = results.predict(X)

    return prediction

  • Both technique required add dummy_variable to isolate categorial features. This will help to improve a lot of model
  • add constant in statmodel (mean shift Y a constant value) is not really meaningless, but just improve model little bit
  • we both use coefficient determination (R) to validate model. close to 1 mean better model

(to be continued)

Leave a Reply

Your email address will not be published. Required fields are marked *