Introduction to regression analysis
Are you looking to understand the power of regression analysis and how it can help you better understand relationships between variables? In this tutorial, I will show you how to do simple linear and multiple regression analyses with SigmaPlot. SigmaPlot includes many statistical methods and 100s of regression equations to choose from, and you can add your own customized regression equation if needed. This tutorial hopefully will make you better understand how regression analysis works and how you can apply it to your research.
Regression analysis is invaluable for gaining insights and making better decisions based on available data while examining complex relationships.
Regression analysis has four main uses: description, estimation, prediction, and control. It describes the relationship between dependent and independent variables, allows for estimation of the dependent variable based on observed independent variables, predicts outcomes and changes in the dependent variable based on their relationship, and controls the effect of one or more independent variables while examining the relationship between one independent variable and the dependent variable.
Types of regression analysis
There are many types of regression analysis, including:
These are some of the most commonly used regression analysis techniques, but many others can be used for specific applications or purposes. No matter the type, all forms of regression analysis investigate how one or more independent variables influence a dependent variable.
Linear regression models, like simple linear and multiple linear, are the most common. But nonlinear regression analysis is often used for more complicated datasets in which the connection between dependent and independent variables is not linear.
This tutorial will demonstrate how you can do simple linear and multiple linear regression in SigmaPlot. We will use example data about housing prices from an article about regression analysis with Excel. Hopefully, you will learn a few tips and tricks about how easy and feature-rich this can be done with SigmaPlot.
For the simple linear regression, we will be using the equation:
And for the multiple linear regression, we will be using the equation (two independent variables):
y is the dependent variable, and x, x1, and x2 are the independent variables.
By doing a regression analysis, we can find the unknown b and a, a1 and a2 variables in our equations above, and then be able to calculate the expected value of y (the dependent variable) for any given value of x (the independent variables).
What is the purpose of regression analysis?
Regression is a statistical technique utilized to determine the relationships between variables in a dataset, allowing for an evaluation of any connections’ strength and statistical significance. It can also be employed to forecast future outcomes based on past occurrences.
Why is it called regression?
It is called regression because it involves finding the line of best fit that describes the relationship between the variables, often referred to as a “regression line”. Regression analysis aims to identify patterns in the data and use them to make predictions about future outcomes.
Which conditions must be satisfied for regression models to work properly?
Regression analysis is simply a calculation carried out on isolated data. The interpretation of a regression’s output as a statistically meaningful quantity that indicates real-world relationships requires researchers to make various classical assumptions, such as:
What mistakes do people make when working with regression analysis?
It’s important to remember that just because there is a correlation between two things doesn’t necessarily mean that one is causing the other. This is a common mistake known as confusing cause and causality. A common cause-and-effect mistake involving house selling prices and square footage is assuming that a larger house will always command a higher selling price. This is an example of a causality error because other factors may affect the selling price of a house, such as the location, age, and condition of the property. So always be wary of making causal claims based solely on correlation – it’s not always as simple as it seems!
Avoid examining every variable available all at once. Doing so may result in the identification of nonexistent relationships. This concept is similar to flipping a coin. If you keep doing it enough times, you will eventually find patterns that are not real, such as a set of consecutive heads.
Take caution when gathering data, considering how it is collected and if you can trust the data.
It is important not to disregard the error term as this can lead to an incorrect perception of certainty in the analysed relationships. Regression analysis may explain 90% of the relationship, but it is crucial to remember that the results are inherently uncertain, and the remaining 10% should not be overlooked.
It’s important to trust your instincts and judgement. Consider whether the results align with your prior understanding of the situation. If anything seems off, question whether it’s due to incorrect data or a significant error. Pairing any regression analysis with observations is crucial to get the full picture. The best scientists examine both the data and real-world observations.
Regression analysis example data
In this tutorial, we will follow and use data from a course on simple linear and multiple regression at Saint Leo University (link to original article). The data set contains (fictional) the selling price, the square footage, the number of bedrooms, and the age of houses (in years) sold in a neighbourhood in the past six months.
Our task is to find a model that predicts the selling price (dependent variable) based on the independent variables of square footage, number of bedrooms and age.
By doing regression analysis on this dataset, we will try to answer the questions:
Which independent variables will have the biggest effect on the selling price? Will it be the square footage, the number of bedrooms, or the age of the house? Will we get a better fit if we include all three independent variables, and if not, which two independent variables should we pick?
Regression analysis is a technique used to mathematically assess which variables have an influence. It can solve inquiries such as: Which characteristics are the most influential? What components can be disregarded? How do these qualities interact with each other? And, probably most significant, how reliable are we about all these aspects?
Regression analysis with SigmaPlot
1. Importing the data to SigmaPlot
In this case, I only had access to the Saint Leo University PDF document and no access to the data file. To avoid wasting time entering the data manually into a SigmaPlot worksheet, I used our PDF management software, FineReader PDF, which has a screenshot reader tool that can extract data tables directly from any screenshot to Microsoft Excel.
There are probably free tools out there doing the same. Try googling “screen capture data tables”. A decent screen-capturing tool is necessary when gathering data from different (old) sources. I can also recommend the screen-capturing tool Snagit for grabbing text from documents and images.
SigmaPlot plays well with Microsoft Excel, so having the data in my Excel sheet, I can easily copy-paste it into my SigmaPlot worksheet. This, however, pastes the column titles in the first row and not in SigmaPlot’s column header/titles. To move them up into the column titles field:
Another way of doing this is by importing the Excel file to your SigmaPlot project.
2. Visualise your data in SigmaPlot
Visualizing your data is a crucial step in understanding and interpreting your results. With SigmaPlot, you have a powerful tool that can help you effectively visualize your data and better understand your results. Whether you want to create simple scatter plots, histograms, or complex 3D surfaces.
Let’s visualise the square footage and Age vs Price.
Please note that you can double-click any element on the graph page to edit it. I.e. double-click the title to change the title text for each of your graphs, double-click an axis to change labels and tick-marks, or click-drag the legend boxes to place them underneath the “X Data” text.
3. Analyse your data and find the best subset for your regression
Finding the best subset of data for regression analysis is an important step in ensuring the accuracy and robustness of your results. In our case, we have three subsets, the three independent variables: Square footage, number of bedrooms, and age. Which of these correlates with the price the most, and are they all relevant to our study?
SigmaPlot provides a range of diagnostic tools that allow you to identify influential observations and check the assumptions of your regression model. These tools can help you to refine your analysis and improve the robustness of your results. We will use the “Best Subset Regression” analysis tool in this case.
Reading the report, we find that the Best Subset for our regression data is:
And the Best Subset report shows that we do not get a better regression model by including the number of bedrooms variable. R-square is equal for using 2 vs 3 independent variables, but Adjusted R-square is higher for using only the two variables, Square footage and Age.
4. Simple linear regression using SigmaPlot’s Regression Wizard
Simple linear regression is a technique in which the correlation between a dependent and independent variable is analyzed following the equation Y = mX + b.
The simple linear model is expressed using the following equation:
To perform a simple linear regression using SigmaPlot and the regression wizard, follow these steps:
SigmaPlot will create a scatter plot of your data with your regression fit line and 95% confidence and prediction bands if you choose this. If you chose SigmaPlot to create a report, you would also find your Regression report sheet with all statistical test results for your analysis.
Using the regression wizard, this is a basic overview of performing a simple linear regression in SigmaPlot. Please refer to the SigmaPlot User’s Guide or help file for more detailed information and options.
5. Multiple linear regression with SigmaPlot
In numerous situations, a single variable may not be adequate to account for variation in Y. A multivariable linear regression can then be implemented to evaluate the effect of multiple variables on the result.
In a multivariable regression model, the dependent variable Y is described as a linear combination of the independent variables of X, given by: Y = a + b1X1 + b2X2 +…+ bn*Xn.
Multiple linear regression analysis is essentially similar to simple linear regression, except that multiple independent variables are used in the model. Multiple linear regression follows the same conditions as the simple linear model, but please note that the independent variables should show a minimum correlation. If the independent variables are strongly correlated, accurately measuring the relationship between the dependent and independent variables will be difficult.
The Subset Regression analysis for our data showed that the best independent variables to use were Square footage and Age, so we will use these two variables for our multiple regression analysis with SigmaPlot in the following.
Regression analysis using SigmaPlot can provide valuable insights into the relationship between two or more variables. A key takeaway from the analysis is that the regression model results can be used to make predictions about the dependent variable’s future values based on the independent variable’s values.
Additionally, the coefficient values and p-values from the regression analysis can be used to determine the significance of each independent variable in explaining the variability in the dependent variable. It is important to carefully assess the assumptions of linearity, homoscedasticity, and normality and to consider transforming the variables or using non-linear regression methods if these assumptions are violated.