Terminology from week 1 (single feature linear regression):
- m is the number of training examples.
- n is the number of features.
- y is the output from the training set
- x are the input features in the training set
- x(i) are the input features of the ith training set.
- hθ =>θ0+θ1x1 + ... + θnxn is the hypothesis function that will be modified to fit the training set.
- θj are the hypothesis values for each feature.
- J is the cost function. A common cost function is the mean square error function
- α is the learning rate
Multiple features:
It is common to add a constant feature to the model: θ0, where x0 = 1
Using matrices, hθ (x) = θTx
Gradient Descent Algorithm
For each θj:
Add the derivative of the cost function for that feature
Add α times the average of the (hypothesis minus the actual observation) multiplied by that feature value.
Gradient Descent - Feature Scaling
The gradient descent algorithm can converge quicker if the feature range are scaled to the same interval.
Feature scaling is dividing the input variables by the range of the input variables. The new range will be 1, but the values can still be higher.
Mean normalization means subtracting the mean from each value. If you combine those methods, you get:
Normal Equation
This is a way to minimize J without iterations.
Gradient Descent or Normal Equation
GD: Need to chose a learning rate. NE: No need to choose a learning rate.
GD: Needs many iterations. NE: Needs no iterations
GD: O(kn^2). NE: O(n^3). NE is heavy for large feature sets!
The next step is a programming assignment where I'll implement linear regression to a simple data set. I won't publish that on my blog, however.
Using matrices, hθ (x) = θTx
Gradient Descent Algorithm
For each θj:
Add the derivative of the cost function for that feature
Add α times the average of the (hypothesis minus the actual observation) multiplied by that feature value.
Gradient Descent - Feature Scaling
The gradient descent algorithm can converge quicker if the feature range are scaled to the same interval.
Feature scaling is dividing the input variables by the range of the input variables. The new range will be 1, but the values can still be higher.
Mean normalization means subtracting the mean from each value. If you combine those methods, you get:
xi := (xi - mean(x))/range(x)
Normal Equation
This is a way to minimize J without iterations.
Gradient Descent or Normal Equation
GD: Need to chose a learning rate. NE: No need to choose a learning rate.
GD: Needs many iterations. NE: Needs no iterations
GD: O(kn^2). NE: O(n^3). NE is heavy for large feature sets!
The next step is a programming assignment where I'll implement linear regression to a simple data set. I won't publish that on my blog, however.