Sunday 1 March 2015

Linear Regression-Machine Learning

Multi-variate linear regression
Regression in most contexts is just a fancy way of interpolation. It is a frequently used in estimation and prediction scenarios. It is one of the favorite tools of a data analyst.

Let’s see how it works by looking at some examples...
Suppose if you have the following data.

It is a 2 dimensional plot depicting the correlation between the population of a city and the profits a chain of retail stores makes per year.
But, what if using this data, you wanted to know the profit (predicted approximate) a given company would make for a city that has a population of 37,890?

This is a very realistic problem where the decision of whether or not to open a retail outlet in that city would depend upon your prediction as an analyst. So how would you give a reasonable estimate?

Let’s see…

If there was some way of fitting a straight line through the data points such that in order to predict for any given population, you would just need to see the ‘Y’ co-ordinate of that point; that would be awesome!

Something like so…

But, how to decide which line would be the best? After all, there is a possibility of drawing infinite such lines!

This is when linear regression comes to the rescue. The whole idea behind linear regression is to be able to draw such a line that it “best fits” the data that you have in order to give more reliable predictions for any new piece of data. One way of doing this is by trying to fit a line in you data in such a way that the Euclidian vertical distance of the line from all your data points is minimized.


Suppose that the red points depict your data, you want to draw the green line in such a way that the combined length of the blue lines is minimized. This algorithm is also known as the least squares method.

So far so good. But what about the case when there are more than one factor that decide the overall profits of the retail chain?
Factors like shop size, number of items sold,  gender ratio in the area, literacy rate( it might be a book store) etc. In theory, there can be infinite such factors that contribute to the sales figures. 

What I have demonstrated so far can be called univariate linear regression where the output  depends only on one factor. Multivariate linear regression is the case when there is more than one (numerically describable) factor which affects the output.

In such scenarios, you cannot just plot a hyper-dimensional graph and try to fit a straight line but you can definitely simulate that mathematically and get the right results.



I am going to give the MATLAB code for this over here but you can also download it from my github page.

Any improvements and contribution to the code is greatly appreciated.


So, suppose that you have some data using which you want to predict for more incoming data, well here is how you can use the following matlab code for doing your regression analysis...


The code first.
#######################################################################
%%multivariate leniar regression
fprintf('welcome to  multivariate linear regression');
fprintf('\n');
data=input('please enter the name of  your data fie in single brackets \n');

d=load(data);
%the data set must have all the begining columns as features and the last column as output

disp(d);
s=size(d);
l=s(1,1);
fprintf('number of data points = %i',s(1,1));
fprintf('\n');
fprintf('the number of features is = %i',s(1,2)-1);
X=d(:,1:s(1,2)-1);
X=normalize(X);
y=d(:,s(1,2));
%disp(x)
%disp(y)
theta=rand(s(1,2),1);
temp=zeros(s(1,2),1);
fprintf('\n');
alpha=input('please define learning rate');
n=input('define number of iterations');
fprintf('\n');
fprintf('running linear regression...');
x=[ones(s(1,1),1),X];
%disp(x)
%disp(x*theta);
for i=1:n
    
    
    for j=1:s(1,2)
        fprintf('%i',theta(j,1));
        temp(j)=(theta(j)-(alpha*(1/l)*(derivative(x,y,theta,j))))
        theta(j)=temp(j);
        
        
    end
   
end
T=theta;
 ##################################################################


You also need the following function. save it as 'derivative.m' in the same directory as the previous code.


######################################
function z=derivative(x,y,theta,j)

z=0;
for i=1:length(y)
    z=z+(x(i,:)*theta-y(i))*(x(i,j));
end

end
#####################################
This is the second function that the code needs to run. save it as 'normalize.m'


##########################################
function z=normalize(x)


s=size(x);
l=s(1,1);
b=s(1,2);
for i=1:b
    for j=1:l
        x(j,i)=((x(j,i)-mean(x(:,i))))/std(x(:,i));
    end
end

z=x;
end
############################################


The algorithm uses gradient descent to optimize its parameters.

This means that while running, will be prompted to define the learning rate and the number of iterations that you want to tun your algorithm for.

Typically, the learning rate is of the order 0.01 but it is recommended that you experiment with it. also, data sets of different sizes may require different number of iterations. Typically , the number of iterations are double the number of data points in your dataset.

NOTE: if your results are blowing up to infinity, that means your learning rate is too big.

How to format input data?
The input data should typically be a .csv file or a similarly syntaxed text file in which all the columns except the last one represent the features on which your prediction depends and the last column denotes the corresponding result.

How to interpret the result?

The output of the code will be a vector 'T' which is the parameter vector that you need to multiply(dot product) with your new feature vector to predict the output. keep in mind that the first element of 'T' must be multiplied with unity and the rest should be multiplied with your new feature vector.

No comments:

Post a Comment