Sunday 1 March 2015

Logistic regression-Machine Learning

Logistic Regression-Classification Algorithm

Logistic regression is a technique for classifying points of dataset into two separate categories. It is a kind of supervised algorithm where you make the computer learn how to differentiate between  two categories using a huge set of examples and based on that the computer should be able to classify a new data point into one of the two categories.

In all its technicalities, Logistic regression is a lot like linear regression in the sence that you decide the membership of a data-point based on a multitude of factors.

Let us try to understand this using an example. Suppose you want to classify tumors based on their size and cell density as to whether they are malignant or benign. Then you draw a graph based on the data that you have. You may get something like this...

The above graph depict malignant (red) and benign (blue) tumors based on the tumor size and cell density.
The green curve is a decision boundary that separates the two categories. Our job as an analyst is to find a way of creating the most suitable boundary curve.  Also, unlike this two dimensional figure, the factors affecting our categorization can be many more, which means that you may need a multi-dimensional space to represent this kind of data. Although it is not possible to do so graphically, maths can handle an infinite number of dimensions. That is when Logistic regression comes into picture.


As an analyst you do not need to worry though, Below I have provided Matlab code for doing this with a dataset with an arbitrary number of dimensions.

All you need to do is to format your data in a .csv or a text format where all the initial columns numerically depict the factors affecting your classification decision and the last column must be classification based on binary numbers .i.e. if the tumor is malignant, label it '1' or if the tumor is benign, label it '0'.

After that all you need to do is execute the code.

There are two things you need to keep in mind before using this algorithm.

  • It can handle classification tasks where there are only two categories
  • All data should be represented in numerical format in the same base except the output data. That has to be binary.
If you want to use this algorithm for solving problems that have more than one classification categories, you can still use this algorithm by the one against all approach. In that case you need to choose one category at a time and compare it to all the others as the binary complement of that category.
Something like this...

Using the code

The algorithm uses gradient descent to optimize its parameters.

This means that while running, will be prompted to define the learning rate and the number of iterations that you want to tun your algorithm for.

Typically, the learning rate is of the order 0.01 but it is recommended that you experiment with it. also, data sets of different sizes may require different number of iterations. Typically , the number of iterations are double the number of data points in your dataset.
NOTE: if your results are blowing up to infinity, that means your learning rate is too big.

How to interpret the results?

The output of the algorithm will be a vector 'T'. You have to dot product this vector (except it's first term) with your new feature vector which needs to be classified. Then add the first term of the 'T' vector to this dot product result. Your answer will be a number between 0 and 1. Suppose that you get 0.7. This means that there is a 70% chance that the given data point belongs to the category labeled 1.


Data representation

Here is the Matlab code...
You can also download this code from my Github Page
#####################################################
%%logistic regression
% regularization pending
fprintf('welcome to logistic regression');
fprintf('\n');
data=input(' mention the name of data set in single quotes \n');
d=load(data);
disp(d);
s=size(d);
l=s(1,1);
b=s(1,2);
fprintf('number of data points = %i',s(1,1));
fprintf('\n');
fprintf('the number of features is = %i',s(1,2)-1);
X=d(:,1:s(1,2)-1);
X=normalize(X);
y=d(:,s(1,2));
%disp(X);
theta=rand(s(1,2),1);
temp=zeros(s(1,2),1);
fprintf('\n');
alpha=input('please define learning rate');
n=input('define number of iterations');
fprintf('\n');
fprintf('running logistic regression...');
x=[ones(s(1,1),1),X];
for i=1:n
    
    
    for j=1:s(1,2)
       % fprintf('%i',theta(j,1));
        temp(j)=(theta(j)-(alpha*(1/l)*(regderivative(x,y,theta,j))))
        theta(j)=temp(j);
        
        
    end
   
end
T=theta;
######################################################

You will also need the folloing functions. 
Save the following function as 'regderivative.m' in the working directory.


#########################################################
function z=regderivative(x,y,theta,j)


z=0;
for i=1:length(y)
    z=z+(sigmoid(x(i,:)*theta)-y(i))*(x(i,j));
end

end
########################################################

Save this as 'normalize.m' in the working directory

########################################################
function z=normalize(x)


s=size(x);
l=s(1,1);
b=s(1,2);
for i=1:b
    for j=1:l
        x(j,i)=((x(j,i)-mean(x(:,i))))/std(x(:,i));
    end
end

z=x;
end
############################################################

save this function as 'sigmoid.m' in your working directory

############################################
function y=sigmoid(x)
y = 1/(1-exp(-x))
#############################################

No comments:

Post a Comment