r2 (or r-squared) is the determination coefficient in the linear regression model that explains how fitting the predicted results are. It is calculated as the variance explained by the total variance. A greater coefficient value means that the model prediction fits all the data in the model. A low value means the predicted curve or line fits fewer data in the model. The coefficient value does not determine the goodness of a model.
Regression Model
Linear regression is a machine learning model that predicts data based on a given dataset. It is widely used to predict data, and it is quite convenient, easy to use and understand, and also accurate in its process. The model works by drawing a line that fits best all the data points in the graph. The drawn line is called the mean value, and it passes through the middle of all the points of data. This line is then used to predict an unknown future value. The line may be linear or non-linear based on the equation developed. It is drawn based on the weights of different factors that affect the dependent data.
Overfit and Underfit
A data model undergoes overfitting when the mean line passes through all of the data points. This is not considered a good fit line because there are many biased points in the dataset, and the fit line passes through all of them, making the prediction inaccurate. A data model is said to have underfitting when the line does not pass through much of the data points. Again, this does not necessarily mean that the model is inaccurate. An optimum position for the mean line is required to produce a good prediction result. Overfitting and underfitting do not conclude the goodness of the model, but it has a big effect on the results.
The Determination Coefficient
The determination coefficient comes useful while measuring the accuracy of the model. It is calculated as the variance explained by the model by the total variance. The value of the coefficient, commonly called r2, determines how accurate the model is based on the fit line. The value of the coefficient is less when the fitting line passes through a smaller number of points. The value is more if the line passes through more points. It is calculated on a scale of 0 to 100%. 0% means the fit line passes through no points, while 100% means the line passes through all of the data points. The value does not determine the real accuracy of the model because of biases in the dataset. Having a lower determination coefficient does not mean the model is bad, or having a higher value does not mean the model is accurate. The optimum value of the coefficient should be between 0 to 100.
Lower r2 Value
A lower r-squared value means that the best fit line does not pass through many data points. This may result in a bad prediction model because the error is high. Since the regression model works best with the fitting line passing close to the points, the low r-squared value may result in an inaccurate regression model. However, this is not always the case. If the errors are low, or if the data is biased, the model can still provide good results. Even if the value is less than 10%, the resultant model can be quite useful and informative.
Higher r2 Value
A higher r2 value necessarily means that the line passes through many data points. While this may generally mean that the regression model is accurate and will provide great results, overfitting may ruin this. If the line passes through biased data points, it may provide errors in prediction. Hence, it is necessary to check on the line and ensure that it is at an optimum position. Increasing the coefficient value should not be the target while creating the model. Having an intentional high r-squared value also means that the line is polynomial. That creates complexities and this ruins the simplicity of the regression model. For complex datasets, we recommend not using linear regression, but switch to a different machine learning algorithm. Every algorithm has a capability and limitations.
The Inference
We should never try to increase or decrease the r-squared value intentionally. This provides a greater chance of errors and an even worse prediction model. The effort should be on building the best fit line that more or less passes through the points. Even if it does not, it does not ensure that the model is useless. Some researchers often rely on low r-squared values for their works. They try to find the reason for the low coefficient value, and this helps them to work on their researches. Hence, having a low value of the determination coefficient does not mean that your efforts were useless. Rather, you should believe your results and find out the reason for this.
Conclusion
Having a high r-squared value means that the best fit line passes through many of the data points in the regression model. This does not ensure that the model is accurate. Having a biased dataset may result in an inaccurate model even if the errors are fewer. Having a low r-squared value is also not perfect as it fails to depict the importance of the dataset in the first place. However, having a low r-squared value is better than having a high yet biased value. R-squared value is calculated as the variance explained divided by the total variance. This value ranges from 0% to 100%. A value of 0% means that the best fit line passes through no data points while having a value of 100% means that all the points are in the best fit line. The optimum determination coefficient is somewhere between 0 and 100, not the extremes.