ロジスティック回帰(Logistic regression)

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

For example:

  • To predict whether an email is spam (1) or (0).
  • Whether online transaction is fraudulent (1) or not (0).
  • Whether the tumor is malignant (1) or not (0).

In other words the dependant variable (output) for logistic regression model may be described as:

Logistic Regression Output

Training Set

Training set is an input data where for every predefined set of features x we have a correct classification y.

Training Set

m – number of training set examples.

Training Set

For convenience of notation, define:

x-zero

Logistic Regression Output

Hypothesis (the Model)

The equation that gets features and parameters as an input and predicts the value as an output (i.e. predict if the email is spam or not based on some email characteristics).

Hypothesis

Where g() is a sigmoid function.

Sigmoid

Sigmoid

Now we my write down the hypothesis as follows:

Hypothesis

Predict 0

Predict 1

Cost Function

Function that shows how accurate the predictions of the hypothesis are with current set of parameters.

Cost Function

Cost Function

Cost function may be simplified to the following one-liner:

Cost Function

Batch Gradient Descent

Gradient descent is an iterative optimization algorithm for finding the minimum of a cost function described above. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

We need to simultaneously update Theta for j = 0, 1, …, n

Gradient Descent

Gradient Descent

alpha – the learning rate, the constant that defines the size of the gradient descent step

x-i-jjth feature value of the ith training example

x-i – input (features) of ith training example

yi – output of ith training example

m – number of training examples

n – number of features

When we use term “batch” for gradient descent it means that each step of gradient descent uses all the training examples (as you might see from the formula above).

Regularization

Overfitting Problem

If we have too many features, the learned hypothesis may fit the training set very well:

overfitting

Solution to Overfitting

Here are couple of options that may be addressed:

  • Reduce the number of features
    • Manually select which features to keep
    • Model selection algorithm
  • Regularization
    • Keep all the features, but reduce magnitude/values of model parameters (thetas).
    • Works well when we have a lot of features, each of which contributes a bit to predicting y.

Regularization works by adding regularization parameter to the cost function:

Cost Function

regularization parameter – regularization parameter

Note that you should not regularize the parameter theta zero.

In this case the gradient descent formula will look like the following:

Gradient Descent

From: