Statistics - Regression Analysis

Table of Contents

This article explains regression analysis in statistics.

What is Regression Analysis? #

Regression analysis is a statistical technique that estimates the effect of one or more independent variables on a dependent variable.

Variables in Regression Analysis #

The Dependent Variable (y) is the variable affected, also known as the Response Variable or Outcome Variable. It is the variable predicted in the model, influenced by other variables.

The Independent Variable (x) influences the outcome, also referred to as the Explanatory Variable or Predictor Variable. Independent variables affect the dependent variable and are used when constructing the prediction model.

Regression Analysis Methods Based on the Number of Variables #

The approach to regression analysis varies based on the number of variables.

If there is one independent variable, it is approached with simple linear regression. If there are two or more independent variables, multiple linear regression analysis is possible.

1. Simple Linear Regression #

This statistical technique estimates the effect of a single independent variable on a dependent variable. The regression line chosen is the one where the difference (residual) between predicted values and actual data is smallest. Among numerous lines, the regression line means the one with a smaller Residual Sum of Squares (RSS).

This is achieved through the Ordinary Least Squares method.

Ordinary Least Squares : A method that creates a sum of squares based on measured values and finds the value that minimizes it, choosing the line with the smallest residual sum of squares.

2. Multiple Linear Regression #

This statistical technique estimates the effect of two or more independent variables on one dependent variable. It’s important to determine the significance of regression coefficients in multiple linear regression analysis because the model needs to be confirmed with the combination of selected variables where all regression coefficients are statistically verified.

The significance of regression coefficients can be confirmed through the regression coefficient t-statistics.

Multicollinearity in Multiple Linear Regression #

Multicollinearity refers to the phenomenon in regression analysis where there is a strong correlation between independent variables, occurring when one independent variable can be predicted from the others.

Multicollinearity complicates the accurate estimation of each independent variable’s regression coefficient. Furthermore, it prevents the regression coefficients of each independent variable from correctly explaining their impact on the dependent variable.

Methods to Test for Multicollinearity

Variance Inflation Factor (VIF):

The Variance Inflation Factor indicates how much the variance of each independent variable has increased, judging that multicollinearity has risen if this value is large. It’s calculated as the variance ratio of linearly regressing each independent variable against the others. Generally, if the VIF is greater than 4, it’s judged that multicollinearity exists, and if greater than 10, it’s considered to be a serious problem.

Considerations in Regression Analysis #

When conducting regression analysis, there are three main points to consider:

1. Are the regression coefficients significant? #

A regression coefficient is considered statistically significant if the p-value of its t-statistic is less than 0.05.This means the coefficient has a significant impact on the dependent variable.

2. How much explanatory power does the model have? #

To check how much explanatory power the model has, the Coefficient of Determination (𝑅²) must be reviewed.

Coefficient of Determination (𝑅²) #

The coefficient of determination is a value between 0 and 1, meaning the model explains the variation in the dependent variable well if it is closer to 1. A high coefficient of determination indicates a high predictive power of the model..

3. Does the model fit the data well? #

To determine if the model fits the data well, residuals are plotted, and regression diagnostics are performed. Residuals, the difference between actual values and the model’s predicted values, are visually reviewed to check how well the model fits the data. Ideally, residuals follow a normal distribution without any specific patterns or trends, and homoscedasticity of residuals must also be confirmed. Outliers or influential data points should be reviewed, and if necessary, removed or adjusted to check the model’s stability.

Through these reviews, the reliability of the regression analysis and the model’s fit can be assessed.