Overview
PyPi module | N/A | |||||||||||||||||||||||||
git repository | https://bitbucket.org/arrizza-public/ai-linear-regression | |||||||||||||||||||||||||
git command | git clone git@bitbucket.org:arrizza-public/ai-linear-regression.git | |||||||||||||||||||||||||
Verification Report | https://arrizza.com/web-ver/ai-linear-regression-report.html | |||||||||||||||||||||||||
Version Info |
|
- installation: see https://arrizza.com/setup-common
Summary
This project shows a simple linear regression calculation and then compares it to scipy linear regression.
This site contains the mathematical description for linear regression: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/simple-linear-regression.html
The doc for scipy's lingress function is here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress
Additional python based regression modules are here: https://realpython.com/linear-regression-in-python/
How it works
The code in app.py builds a couple of lists, one each for the x and y data. These are generated from a linear equation we already know. That way we can compare those equation constants to the ones calculated by the linear regression. And that gives us evidence that the calculations are correct.
Right now, the x values run from -5.0 to 5.0 every 1.0 e.g. -5.0, -4.0, -3.0 ... 4.0, 5.0.
self._gen_x_values(data, -5.0, 5.0, increment=1.0) # in app.py, function _gen_values()
The y values are calculated from the x values using the slope and intercept:
data.slope = 3.0
data.intercept = 5.0
To run it
You can run the simple or scipy or both to compare their resulting data. The calculated slope and intercept should match the original slope and intercept (i.e. 3 and 5) when there is no noise in the data.
./doit --tech simple
./doit --tech scipy
./doit --tech all
To see what happens when there is noise, use the --noise
switch:
./doit --tech simple --noise 0.1 # adds random +/-0.1 noise to each y value
Note: default noise is 0.0 i.e. no noise
Simple linear regression
Run the simple linear regression
./doit --tech simple --noise 0.0
==== linear regression: simple
noise: 0.00
-> generated 11 data.x values
-> generated 11 data.y values
-> first 15 data values:
x,y[ 0]: -5.00, -10.00
x,y[ 1]: -4.00, -7.00
x,y[ 2]: -3.00, -4.00
x,y[ 3]: -2.00, -1.00
x,y[ 4]: -1.00, 2.00
x,y[ 5]: 0.00, 5.00
x,y[ 6]: 1.00, 8.00
x,y[ 7]: 2.00, 11.00
x,y[ 8]: 3.00, 14.00
x,y[ 9]: 4.00, 17.00
x,y[10]: 5.00, 20.00
The x,y data is reported. It is limited to a max of 15 data points.
x,y[ 0]: -5.00, -10.00
# when x == -5.00, then:
# y = -10.00
# y = slope*x + intercept
# y = 3.0 * x + 5.0
The results show that the regression slope/intercept exactly match the original slope/intercept i.e. 0% error in both.
-> simple: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : 0.00%
See the class SimpleLinearRegression for the mathematical steps taken to calculate the linear regression.
Add Noise
Using --noise
shows how the calculation still works okay even when there is noise in the data.
./doit --tech simple --noise 0.1
<skip>
-> simple: regression results:
regression : y = 3.00 * x + 5.04
original : y = 3.00 * x + 5.00
err slope : 0.12%
err intercept : 0.82%
There is a slight difference in the slope (3.00) but it is rounded for display purposes and so it still shows as "3.00".
Now add more noise:
./doit --tech simple --noise 1.5
<snip>
-> simple: regression results:
regression : y = 2.94 * x + 5.80
original : y = 3.00 * x + 5.00
err slope : -1.96%
err intercept : 15.90%
scipy linear regression
Run the simple linear regression.
./doit --tech scipy --noise 0.0
<skip>
-> scipy: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : 0.00%
rvalue : 1.0
pvalue : 5.8534851285390365e-90
std_err : 0.0
The initial part of the run is identical to "simple". And the results for slope/intercept are identical too. The rvalue, pvalue and std_err are described in the scipy doc (see the link above)
Run both
You can run both on the same data to compare and make sure they match:
./doit --tech all --noise 0.0
<skip>
-> simple: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : 0.00%
-> scipy: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : 0.00%
rvalue : 1.0
pvalue : 5.8534851285390365e-90
std_err : 0.0
And now with noise:
./doit --tech all --noise 1.5
<skip>
-> simple: regression results:
regression : y = 2.95 * x + 5.67
original : y = 3.00 * x + 5.00
err slope : -1.75%
err intercept : 13.47%
-> scipy: regression results:
regression : y = 2.95 * x + 5.67
original : y = 3.00 * x + 5.00
err slope : -1.75%
err intercept : 13.47%
rvalue : 0.9987438408820251
pvalue : 5.156252366187247e-13
std_err : 0.049293050131861714
-----------
Note the error% for slope and intercept are the same for both methods.
Changing the number of data points
To change the number of data points or the boundaries of the data, modify this line:
self._gen_x_values(data, -5.0, 5.0, increment=0.01)
Running with no noise, gets similar results as above:
./doit --tech all --noise 0.0
<skip>
-> generated 1001 data.x values
-> generated 1001 data.y values
<skip>
-> simple: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : 0.00%
-> scipy: regression results:
regression : y = 3.00 * x + 5.00
original : y = 3.00 * x + 5.00
err slope : 0.00%
err intercept : -0.00%
rvalue : 1.0
pvalue : 0.0
std_err : 0.0
Running with noise reduces the std_err value but everything else is roughly the same:
./doit --tech all --noise 1.5
-> simple: regression results:
regression : y = 3.00 * x + 5.75
original : y = 3.00 * x + 5.00
err slope : -0.03%
err intercept : 14.94%
-> scipy: regression results:
regression : y = 3.00 * x + 5.75
original : y = 3.00 * x + 5.00
err slope : -0.03%
err intercept : 14.94%
rvalue : 0.998766741274143
pvalue : 0.0
std_err : 0.004716712419855965
std_err dropped from 0.049293050131861714 to 0.004716712419855965.