What is the cost function in linear regression

Linear regression. Christian Herta. October, problem of the cost function gradient descent method

Transcript

1 Linear Regression Christian Herta October, from 33 Christian Herta Linear Regression

2 Learning Outcomes Linear Regression Concepts of Machine Learning: Learning by means of a training set Cost function Gradient descent method (gradient descent) 2 of 33 Christian Herta Linear Regression

3 Outline 1 Problem 2 Cost Function 3 Gradient Descent Method 3 of 33 Christian Herta Linear Regression

4 Linear regression Supervised learning: m-observations: {x (i)} with target values ​​{y (i)} Goal: Prediction of a value y for a new value for x. Linear model What does the straight line equation look like? 4 of 33 Christian Herta Linear Regression

5 Linear regression Supervised learning: m-observations: {x (i)} with target values ​​{y (i)} Goal: Predicting a value y for a new value for x. Linear model parameters): h Θ (x) = Θ 0 + Θ 1 x (two 4 of 33 Christian Herta linear regression

6 Linear regression idea: Find a straight line h Θ (x) that is as close as possible to the data points. 5 of 33 Christian Herta Linear Regression

7 Training set notation: m: number of training examples x: input variable y: output variable (x, y): a training example (x (i), y (i)): ite-training example sample data set: Hg-PCV hemoglobin packed cell level / g / dl (x) volume (y) by 33 Christian Herta Linear Regression

8 Overview: Training method model h Θ (x) Determination of the model parameters Θ by learning from the data (training set) Function h Θ: Hypothesis 7 of 33 Christian Herta Linear regression

9 One-Variable Linear Regression (Univariate Linear Regression) Why the name One-Variable Linear Regression? One variable: x hypothesis h Θ (x) = Θ 0 + Θ 1 x hypothesis is linear with respect to the variable x hypothesis is linear with respect to the adjustable parameters Θ 0, Θ 1. Prediction of a floating point number using the hypothesis: regression 8 of 33 Christian Herta Linear Regression

10 Outline 1 Problem 2 Cost function 3 Gradient descent method 9 of 33 Christian Herta Linear regression

11 Cost function (cost fuction) Starting point hypothesis hθ (x) = Θ 0 + Θ 1 x training set D (pairs (x, y)) Goal: Determination of the model parameters Θ = {Θ 0, Θ 1} by learning from the data (training set ) Cost function ((squared error) cost function): JD (Θ) = 1 2m m (h Θ (x (i)) y (i)) 2 i = 1 Goal: minimize costs (function) minimize θ J (Θ ) 10 of 33 Christian Herta Linear Regression

12 Cost function (cost fuction) Note the cost function J (Θ) is a function of Θ. Hypothesis h Θ (x) is a function of x with fixed parameters Θ. Explanation of both functions on the board using the simple example hθ 1 (x) = Θ 1 x and 3 training examples for which a hypothesis (only here: J Θ min 1 = 0) can be found. 11 of 33 Christian Herta Linear Regression

13 Example: Cost function and hypothesis Data generation: y (x) = x + N (µ = 0, σ = 2.5) (N: normal distribution) Hypothesis: h (x) = Θ 1 x 12 of 33 Christian Herta linear regression

14 Problem with two parameters Hypothesis: h Θ (x) = Θ 0 + Θ 1 x two parameters: Θ 0, Θ 1 Cost function: J (Θ 0, Θ 1) = 1 2m mi = 1 (h Θ (x (i )) y (i)) 2 Representation of J (Θ 0, Θ 1) in three dimensions: Θ 0, Θ 1, J 13 von 33 Christian Herta Linear Regression

15 Contour Plot 14 of 33 Christian Herta Linear Regression

16 Data Space and Parameter Space 15 of 33 Christian Herta Linear Regression

17 Cost function - overview Costs are a function of the parameters The aim is to minimize costs in order to nd good parameters. Concept of the cost function also for other types of model functions, such as neural networks and k-means clustering 16 by 33 Christian Herta Linear Regression

18 Outline 1 Problem 2 Cost Function 3 Gradient Descent Procedure 17 of 33 Christian Herta Linear Regression

19 Problem hypothesis: h Θ (x) = Θ 0 + Θ 1 x parameters: Θ 0, Θ 1 cost function: J (Θ 0, Θ 1) = 1 m 2m i = 1 (h Θ (x (i)) y (i)) 2 Goal: minimize Θ J (Θ) 18 of 33 Christian Herta Linear Regression

20 Gradient descent Goal: minimize costs (function) minimize θ J (Θ) 1 Start with special values ​​for Θ. For univariate linear regression: Θ = {Θ 0, Θ 1} 2 Change the values ​​for Θ so that J (Θ) becomes smaller. Repeat step 2 until a minimum is reached. 19 of 33 Christian Herta Linear Regression

21 Gradient descent method Goal: Minimize the costs (function) minimize θ J (Θ) 1 Start with special values ​​for Θ 0, Θ 1 2 Determine the gradient (partial derivatives) to find new Θ 0, Θ 1 values ​​in the vicinity of the old Θ Values ​​with the following update rule: Θ new y Θ old j with α: learning rate α Θ j J (Θ old) 3 Go to 2 until a stopping condition is met, e.g. only marginal change in costs. 20 of 33 Christian Herta Linear Regression

22 Simultaneous update of all parameters Note during implementation: Simultaneous update of all parameters temp0 Θ 0 α Θ 0 J (Θ 0, Θ 1) temp1 Θ 1 α Θ 1 J (Θ 0, Θ 1) Θ 0 temp0 Θ 1 temp1 21 from 33 Christian Herta Linear Regression

23 Calculating the Gradient Math Exercise 22 of 33 Christian Herta Linear Regression

24 Gradient descent method for linear regression: Θ 0 Θ 0 Θ 0 α Θ 0 J (Θ) Θ 0 J (Θ) = Θ 0 1 2m m (h Θ (x (i)) y (i)) 2 i = 1 = 1 m (Θ 0 + Θ 1 x (i) y (i)) 2 Θ 0 2m i = 1 = 1 m (Θ 0 + Θ 1 x (i) y (i)) mi = 1 23 of 33 Christian Herta Linear regression

25 Gradient descent method for linear regression: Θ 1 Θ 1 Θ 1 α Θ 1 J (Θ) Θ 1 J (Θ) = Θ 1 1 2m m (h Θ (x (i)) y (i)) 2 i = 1 = 1 m (Θ 0 + Θ 1 x (i) y (i)) 2 Θ 1 2m i = 1 = 1 m (Θ 0 + Θ 1 x (i) y (i)) x (i) mi = 1 24 by 33 Christian Herta Linear Regression

26 Step size The step size depends on two factors: the size of the gradient Θ i J (Θ) learning rate α> 0 (hyperparameter) α must be chosen correctly (more on this later). 25 of 33 Christian Herta Linear Regression

27 26 of 33 Sample data set: Christian Herta Lineare Hg-PCV regression Approximation of the straight line with the iterations Start value for Θ = (1., 1.)

28 27 of 33 Sample data set: Christian Herta Lineare Hg-PCV regression approximations of the straight line with the iterations starting value for Θ = (1., 1.)

29 Why is learning so slow? (negative) gradient points (mostly) away from the minimum! Zig-zag movement in the parameter space or very small α 28 of 33 Christian Herta Linear Regression

30 Cost function with rescaled x-values ​​Solution: Feature Scaling - Explanation later in the course Note: The gradient points directly to the minimum! Example data set: Hg-PCV 29 of 33 Christian Herta linear regression

31 Cost function with rescaled x-values ​​Explanation using a simple example: x-values ​​of the green data are multiplied by a factor of 2. 30 of 33 Christian Herta Linear Regression

32 Batch, Mini-Batch and Online Learning Batch-Learning: Use all training data for an optimization step Mini-Batch Learning: Use a (small) part of the training data for an optimization step Online Learning: Use only one training date per step, typically randomly selected (Stochastic Gradient Descent) 31 of 33 Christian Herta Linear Regression

33 Bibliography Andrew Ng: Machine Learning. Openclassroom Stanford University, 2013 Further reading: C. Bishop: Pattern recognition and Machine Learning, Springer Verlag von 33 Christian Herta Lineare Regression