Machine Learning – Basic Implementation (Part I – Linear Regression)
Abstract
This post is a series of how to do basic implement specific machine learning algorithm to solve a problem. This implementation will help you, step-by-step, tune plenty of processes, in order, to optimise model.
Prerequisites
You should have overview mathematical of linear regression and concept of terms: features, gradient descent, cost function …
We also assume that all data be collected and cleaned before doing any implementation
Deployment
1. Normalize data
Base on the variety of data scale, we should normal data before do any visualisation . It can easy apply with python numpy library
import numpy as np def normalize_features(df): """ Normalize the features in the data set. """ mu = df.mean() sigma = df.std() if (sigma == 0).any(): raise Exception("One or more features had the same value for all samples, and thus could " + \ "not be normalized. Please do not include features with only a single value " + \ "in your model.") df_normalized = (df - df.mean()) / df.std() return df_normalized, mu, sigma
2. Visualize data
Of course, first steps to choose any model to train is to see how data (x,y) interact by graph. You will know linear regression to be correct algorithm if your data (or graph of your data) satisfy all criteria below:
- The scatter of points has to be around the best-fit line and same standard deviation all long the curve. if there are many points too far (high or low) from best-line, that should be not a linear regression
- The measurement of x (the features) should be exactly correct. Imprecision of measuring X (if happen) should be very small compared to biological variability of Y
- The data input should be independent each other. it mean if there is a change in one data experiment, the others should not be change
- Is a correlation between X and Y. for example: midterm score vs total score. while midterm score is a parameter (or component) to calculate total score, linear regression is not valid for these datas.
3. Gradient descent vs statmodel OLS
Before talking about tuning model, we get started with a basic step using gradient descent and statmodel OLS to find a first set of parameter theta. This set of theta might not be the best, but it provide overview step how to tune model later.
All cleaned data can be download from here: Turnstile Data of New York Subway
By gradient descent
def compute_cost(features, values, theta): m = len(values) sum_of_square_errors = np.square(np.dot(features, theta) - values).sum() cost = sum_of_square_errors / (2 * m) return cost def gradient_descent(features, values, theta, alpha, num_iterations): m = len(values) cost_history = [] for i in range(num_iterations): predict_v = np.dot(features, theta) theta = theta - alpha / m * np.dot((predict_v - values), features) cost = compute_cost(features, values, theta) cost_history.append(cost) return theta, pandas.Series(cost_history) def predictions(dataframe): # Select Features (try different features!) - all feature to predict ENTRIESn_hourly] features = dataframe[['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi']] dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit') features = features.join(dummy_units) # Values - or y in model values = dataframe['ENTRIESn_hourly'] m = len(values) features, mu, sigma = normalize_features(features) # Add a column of 1s (for theta0) features['ones'] = np.ones(m) # Convert features and values to numpy arrays features_array = np.array(features) values_array = np.array(values) # Set values for alpha, number of iterations. # learning rate alpha = 0.1 # # of data set want to try num_iterations = 15000 # Initialize theta, perform gradient descent theta_gradient_descent = np.zeros(len(features.columns)) theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent, alpha, num_iterations) predictions = np.dot(features_array, theta_gradient_descent) #coefficient of determination r = math.sqrt(compute_r_squared(values_array, predictions))
By stamodel OLS
def predictions(df_in): # select the features to use #feature_names = ['meantempi', 'Hour'] feature_names = ['Hour', 'maxpressurei','maxdewpti','mindewpti', 'minpressurei','meandewpti','meanpressurei', 'meanwindspdi','mintempi','meantempi', 'maxtempi','precipi'] # initialize the Y values X = sm.add_constant(df_in[feature_names]) Y = df_in['ENTRIESn_hourly'] # initialize the X features by add dummy units, standardize, and add constant dummy_units = pd.get_dummies(df_in['UNIT'], prefix='unit') X = df_in[feature_names].join(dummy_units) X, mu, sigma = normalize_features(X) # add constant in model will improve a little bit X = sm.add_constant(X) # ordinary least squares model model = sm.OLS(Y, X) # fit the model results = model.fit() prediction = results.predict(X) return prediction
- Analysis
- Both technique required add dummy_variable to isolate categorial features. This will help to improve a lot of model
- add constant in statmodel (mean shift Y a constant value) is not really meaningless, but just improve model little bit
- we both use coefficient determination (R) to validate model. close to 1 mean better model
(to be continued)