Linear Regression sample

Overview

PyPi module N/A
git repository https://bitbucket.org/arrizza-public/ai-linear-regression
git command git clone git@bitbucket.org:arrizza-public/ai-linear-regression.git
Verification Report https://arrizza.com/web-ver/ai-linear-regression-report.html
Version Info
  • macOS 14.5, Python 3.10
  • Ubuntu 20.04 focal, Python 3.10
  • Ubuntu 22.04 jammy, Python 3.10
  • Ubuntu 24.04 noble, Python 3.10

Summary

This project shows a simple linear regression calculation and then compares it to scipy linear regression.

This site contains the mathematical description for linear regression: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/simple-linear-regression.html

The doc for scipy's lingress function is here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress

Additional python based regression modules are here: https://realpython.com/linear-regression-in-python/

How it works

The code in app.py builds a couple of lists, one each for the x and y data. These are generated from a linear equation we already know. That way we can compare those equation constants to the ones calculated by the linear regression. And that gives us evidence that the calculations are correct.

Right now, the x values run from -5.0 to 5.0 every 1.0 e.g. -5.0, -4.0, -3.0 ... 4.0, 5.0.

self._gen_x_values(data, -5.0, 5.0, increment=1.0)  # in app.py, function _gen_values()

The y values are calculated from the x values using the slope and intercept:

data.slope = 3.0
data.intercept = 5.0

To run it

You can run the simple or scipy or both to compare their resulting data. The calculated slope and intercept should match the original slope and intercept (i.e. 3 and 5) when there is no noise in the data.

./doit --tech simple
./doit --tech scipy
./doit --tech all

To see what happens when there is noise, use the --noise switch:

./doit --tech simple --noise 0.1 # adds random  +/-0.1 noise to each y value

Note: default noise is 0.0 i.e. no noise

Simple linear regression

Run the simple linear regression

./doit --tech simple --noise 0.0
==== linear regression: simple
     noise:    0.00
  -> generated 11 data.x values
  -> generated 11 data.y values
  -> first 15 data values:
     x,y[ 0]:   -5.00,  -10.00
     x,y[ 1]:   -4.00,   -7.00
     x,y[ 2]:   -3.00,   -4.00
     x,y[ 3]:   -2.00,   -1.00
     x,y[ 4]:   -1.00,    2.00
     x,y[ 5]:    0.00,    5.00
     x,y[ 6]:    1.00,    8.00
     x,y[ 7]:    2.00,   11.00
     x,y[ 8]:    3.00,   14.00
     x,y[ 9]:    4.00,   17.00
     x,y[10]:    5.00,   20.00

The x,y data is reported. It is limited to a max of 15 data points.

 x,y[ 0]:   -5.00,  -10.00
# when x == -5.00, then:  
#    y = -10.00
#    y = slope*x + intercept
#    y = 3.0 * x + 5.0   

The results show that the regression slope/intercept exactly match the original slope/intercept i.e. 0% error in both.

  -> simple:  regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :    0.00%

See the class SimpleLinearRegression for the mathematical steps taken to calculate the linear regression.

Add Noise

Using --noise shows how the calculation still works okay even when there is noise in the data.

./doit --tech simple --noise 0.1
<skip>
  -> simple:  regression results:
     regression     : y = 3.00 * x + 5.04
     original       : y = 3.00 * x + 5.00
     err slope      :    0.12%
     err intercept  :    0.82%

There is a slight difference in the slope (3.00) but it is rounded for display purposes and so it still shows as "3.00".

Now add more noise:

 ./doit --tech simple --noise 1.5
<snip>
  -> simple:  regression results:
     regression     : y = 2.94 * x + 5.80
     original       : y = 3.00 * x + 5.00
     err slope      :   -1.96%
     err intercept  :   15.90%

scipy linear regression

Run the simple linear regression.

./doit --tech scipy --noise 0.0
<skip>
  -> scipy: regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :    0.00%
     rvalue         : 1.0
     pvalue         : 5.8534851285390365e-90
     std_err        : 0.0

The initial part of the run is identical to "simple". And the results for slope/intercept are identical too. The rvalue, pvalue and std_err are described in the scipy doc (see the link above)

Run both

You can run both on the same data to compare and make sure they match:

./doit --tech all --noise 0.0
<skip>
  -> simple:  regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :    0.00%
  -> scipy: regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :    0.00%
     rvalue         : 1.0
     pvalue         : 5.8534851285390365e-90
     std_err        : 0.0

And now with noise:

./doit --tech all --noise 1.5
<skip>
  -> simple:  regression results:
     regression     : y = 2.95 * x + 5.67
     original       : y = 3.00 * x + 5.00
     err slope      :   -1.75%
     err intercept  :   13.47%
  -> scipy: regression results:
     regression     : y = 2.95 * x + 5.67
     original       : y = 3.00 * x + 5.00
     err slope      :   -1.75%
     err intercept  :   13.47%
     rvalue         : 0.9987438408820251
     pvalue         : 5.156252366187247e-13
     std_err        : 0.049293050131861714
     -----------

Note the error% for slope and intercept are the same for both methods.

Changing the number of data points

To change the number of data points or the boundaries of the data, modify this line:

self._gen_x_values(data, -5.0, 5.0, increment=0.01)

Running with no noise, gets similar results as above:

./doit --tech all --noise 0.0
<skip>
    -> generated 1001 data.x values
    -> generated 1001 data.y values
<skip>
    -> simple:  regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :    0.00%
  -> scipy: regression results:
     regression     : y = 3.00 * x + 5.00
     original       : y = 3.00 * x + 5.00
     err slope      :    0.00%
     err intercept  :   -0.00%
     rvalue         : 1.0
     pvalue         : 0.0
     std_err        : 0.0

Running with noise reduces the std_err value but everything else is roughly the same:

./doit --tech all --noise 1.5

 -> simple:  regression results:
     regression     : y = 3.00 * x + 5.75
     original       : y = 3.00 * x + 5.00
     err slope      :   -0.03%
     err intercept  :   14.94%
  -> scipy: regression results:
     regression     : y = 3.00 * x + 5.75
     original       : y = 3.00 * x + 5.00
     err slope      :   -0.03%
     err intercept  :   14.94%
     rvalue         : 0.998766741274143
     pvalue         : 0.0
     std_err        : 0.004716712419855965

std_err dropped from 0.049293050131861714 to 0.004716712419855965.

- John Arrizza