We’ve got a new home: Visit us here: https://analyticsdefined.com/

 

Linear regression modeling is one of the most frequently used supervised learning technique. It is useful when the dependent variable is continuous (ratio or interval scale) and there exists a linear relationship between the dependent and independent variables. This post is a quick guide to perform linear regression in R and how to interpret the model results.

In the example, “Longley” dataset is used to illustrate linear regression in R. Longley” dataset describes 7 economic variables observed from 1947 to 1962 used to predict the number of people employed yearly.

## Load the dataset. We will be using longley dataset for analysis. The longley dataset describes 7 economic variables observed from 1947 to 1962 used to predict the number of people employed yearly.
data("longley")

## Check the data (first 5 rows)
head(longley,5)
##      GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
## 1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
## 1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
## 1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
## 1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
## 1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
## Check the structure of the data
str(longley)
## 'data.frame':    16 obs. of  7 variables:
##  $ GNP.deflator: num  83 88.5 88.2 89.5 96.2 ...
##  $ GNP         : num  234 259 258 285 329 ...
##  $ Unemployed  : num  236 232 368 335 210 ...
##  $ Armed.Forces: num  159 146 162 165 310 ...
##  $ Population  : num  108 109 110 111 112 ...
##  $ Year        : int  1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
##  $ Employed    : num  60.3 61.1 60.2 61.2 63.2 ...
## Summarize the data
summary(longley)
##   GNP.deflator         GNP          Unemployed     Armed.Forces  
##  Min.   : 83.00   Min.   :234.3   Min.   :187.0   Min.   :145.6  
##  1st Qu.: 94.53   1st Qu.:317.9   1st Qu.:234.8   1st Qu.:229.8  
##  Median :100.60   Median :381.4   Median :314.4   Median :271.8  
##  Mean   :101.68   Mean   :387.7   Mean   :319.3   Mean   :260.7  
##  3rd Qu.:111.25   3rd Qu.:454.1   3rd Qu.:384.2   3rd Qu.:306.1  
##  Max.   :116.90   Max.   :554.9   Max.   :480.6   Max.   :359.4  
##    Population         Year         Employed    
##  Min.   :107.6   Min.   :1947   Min.   :60.17  
##  1st Qu.:111.8   1st Qu.:1951   1st Qu.:62.71  
##  Median :116.8   Median :1954   Median :65.50  
##  Mean   :117.4   Mean   :1954   Mean   :65.32  
##  3rd Qu.:122.3   3rd Qu.:1958   3rd Qu.:68.29  
##  Max.   :130.1   Max.   :1962   Max.   :70.55

The above output shows the data, its structure and the summary of the variables in the data. We will now build try to build a regression model using ‘lm()‘ function in R. The dependent variable in this model will be ‘Employed‘ and remaining 6 will be independent variables in the model. The model summary can be added to the output by using the ‘summary()‘ function.

## Fit the regression model to the data
mod <- lm(formula = Employed~., data=longley)

## Model Summary
summary(mod)
## 
## Call:
## lm(formula = Employed ~ ., data = longley)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41011 -0.15767 -0.02816  0.10155  0.45539 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
## GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
## GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
## Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
## Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
## Population   -5.110e-02  2.261e-01  -0.226 0.826212    
## Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3049 on 9 degrees of freedom
## Multiple R-squared:  0.9955, Adjusted R-squared:  0.9925 
## F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10

Output Analysis

Let’s analyze each component in the model output one by one:

Call:

The call in the output shows the function that was used in R to fit the data.  The function used is fairly simple which takes the formula, which needs the dependent and the independent variables and the second argument is the dataset being used.

Residuals:

Residuals are the next component in the model summary. Residuals are the difference between the predicted values by the model and the actual values in the dataset. For the model to be good, the residuals should be normally distributed. From the output, we can see that the residuals are not normally distributed which means that certain predicted values by the model are far away from the actual values. We can further verify this by plotting the residual values (Q-Q plot, histogram with a normality curve).

Coefficients:

The next component in the model summary is one of the most important components in the model as it gives us the model equation which can be used for predicting future values. The equation for multiple regression models is given by:

y = ß01x12x2+…+ßnxn

Where,
y → Dependent variable
ß0 → value for intercept
ß1 → slope for independent variable 1
ß2 → slope for independent variable 2
ßn → slope for independent variable n
x1 → independent variable 1
x2 → independent variable 2
xn → independent variable n

Coefficient – Estimate

The coefficient Estimate is the value of the coefficient that is to be used in the equation.  The coefficients for each of the independent variable has a meaning, for example, 1.506e-02 for ‘GNP.deflator’ means that for every 1 unit change in ‘GNP.deflator’, the value of ‘Employed‘ increases by 1.506e-02. Based on the coefficients estimate, the equation for our model is:


Employed = (-3.482e+03) + (1.506e-02 * GNP.deflator) – (3.582e-02 * GNP)  – (2.020e-02 * Unemployed) – (1.033e-02 * Armed.Forces) – (5.110e-02 * Population) + 1.829e+00 * Year

Coefficient – Standard Error

The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. We need this to be minimal for the variable to be able to predict accurately.

Coefficient – t value

The coefficient t-value measures how many standard deviations our coefficient estimate can be far away from 0. We want this value to be high so that we can reject the null hypothesis (H0) which is ‘there is no relationship between dependent and independent variables’.

Coefficient – Pr(>t)

The Pr(>t) is computed from the t values. This is used for rejecting the Null Hypothesis (H00) stated above. Normally, the value for this less than 0.05 or 5% is considered to be the cut-off point for rejecting H0.

Residual Standard Error:

This is the measure of the quality of the fit of the regression model. Every linear model has an error term which is the reason why the prediction accuracy for the model is never 100%. The residual standard error is the average of the distances of the deviations for the predicted and the actual values.

Multiple R-squared:

Multiple R-squared statistics is the actual measure of the how well the data fit the model. R-square value explains what percentage of the variance in the independent (response) variable is being explained by the independent (predictor) variables. For our model, the R-squared value of 0.9955 or 99.55% means that 99.55% variance in the ‘Employed’ variable is being explained by the six dependent variables.

Adjusted R-squared:

Adjusted R-squared is considered for evaluating model accuracy when the number of independent variables is greater than 1. Adjusted R-squared adjusts the number of variables considered in the model and is the preferred measure for evaluating the model goodness.

F-Statistic

F-statistic is used for finding out if there exists any relationship between our independent (predictor) and the dependent (response) variables. Normally, the value of F-statistic greater than one can be used for rejecting the null hypothesis (H0: There is no relationship between Employed and other independent variables). For our model, the value of F-statistic, 330.6 is very high because of the limited data points. The p-value in the output for F-statistic is evaluated the same way we use the Pr(>t) value in the coefficients output. For the p-value, we can reject the null hypothesis (H0) as p-value < 0.05.

Relationship between R-squared and p-value in Regression

While trying to evaluate the model results, these two statistics are most frequently used and are most often confused with. To clear the confusion, there is no established relationship between the two.

R-squared tells how much variation in the response variable is explained by the predictor variables while p-value tells if the predictors used in the model are able to explain the response variable or not. If p-value < 0.05 (for 95% confidence), then the model is considered to be good.

Based on this, we have four different conditions for these two combined:

  1. low R-square and low p-value (p-value <= 0.05): This means that the model doesn’t explain much of the variation in the response variable, but still this is considered better than having no model to explain the response variable as it is significant as per the p-value.
  2. low R-square and high p-value (p-value > 0.05): This means that model doesn’t explain much variation in the data and is not significant. We should discard such model as this is the worst scenario.
  3. high R-square and low p-value: This means that model explains a lot of variation in the data and is also significant. This scenario is best of the four and the model is considered to be good in this case.
  4. high R-square and high p-value: This means that variance in the data is explained by the model but it is not significant. We should not use such model for predictions.