**By Data Tricks, 28 May 2020**

For this tutorial we’re going to use some House Price data from Kaggle. You’ll need to download both the train.csv and test.csv files from here.

Once you’ve downloaded the files, read them into R as follows:

rm(list=ls()) train <- read.csv("C:/Users/Me/Desktop/train.csv") test <- read.csv("C:/Users/Me/Desktop/test.csv")

Remember to change the file path to where you’ve saved the files. You can type *str(train)* to see a list of the variables and their types, and you should see that there are a total of 81 variables.

A linear regression models the relationship between a dependent variable (that is the variable you are trying to predict) and one or more independent variables (the variables you are using to make the prediction). It therefore follows that one of the first steps in the process of linear regression should be to check the correlation between the dependent variable and potential independent variables. The process of variable selection can be an iterative one – once a linear regression has been carried out we can go back and tweak our variable selection.

Check the correlation between all numeric variables in the dataset.

cor(train[,unlist(lapply(train, is.numeric))])

**Tip!** The command *unlist(lapply(train. is.numeric))* returns a list of TRUE and FALSE values for each column in the dataframe according to whether that column is numeric. If we wanted to do a more thorough linear regression, we would first look into transforming all non-numeric variables into numeric variables in order to include them in the linear regression, but for the purposes of this tutorial we will ignore these variables for now.

You should now have a matrix of correlation values for each column of the dataframe. At this stage we are only interested in the correlation of each variable with the dependent variable, which in this case is the very last column *SalePrice*.

The five highest correlation values (remember that you should be looking for the highest *absolute *values, so include negative values as well) are for the following variables:

**OverallQual**: the overall material and finish of the house (correlation = 0.79)

**GrLivArea**: the above-ground living area in feet (correlation = 0.71)

**GarageCars**: the size of the garage in car capacity (correlation = 0.64)

**GarageArea**: the size of the garage in square feet (correlation = 0.62)

**TotalBsmtSF**: the below-ground living area in fee (correlation = 0.61)

One of the requirements of linear regression is that independent variable must not be highly correlated with one another (the fancy word is multicollinearity). From the previous correlation matrix we ran, we can see that GarageCars and GarageArea are strongly correlated, with a value of 0.88, so one of these two variables can probably be removed. In addition, GrLivArea and TotalBsmtSF are also moderately correlated at 0.45 and it might be reasonable to remove one of these variables as well.

For the purposes of this tutorial, we are going to remove GarageArea and keep GarageCars, given the latter had a slightly stronger correlation with SalePrice. But we’ll keep both GrLivArea and TotalBsmtSF – we can always remove these later once we’ve applied linear regression.

Helpfully, the method of applying machine learning regression algorithms in R usually follows a similar syntax, which is simply:

function(formula, data)

where *function *is the function for the algorithm you’ve chosen (in the case of linear regression, this is *lm*), *formula *is where you specify your dependent and independent variables separated by a ~ (ie. *dependent.variable ~ independent.variable.1 + independent.variable.2*….etc,), and *data *is where you specify the dataframe on which to train the algorithm.

To run a linear regression, type the following:

library(dplyr) train.sub <- train %>% select(SalePrice, OverallQual, GarageCars, GrLivArea, TotalBsmtSF) myModel <- lm(SalePrice ~., data = train.sub)

**Tip!** If you put a full-stop after the ~ in the formula, this tells R to use all the variables (except the dependent variable already specified) in your dataframe as independent variables. Notice that we’ve used dplyr first to select only the columns that contain our dependent and independent variables.

Type *summary(myModel) *into the console and you should get the following:

> summary(myModel)

Call: lm(formula = SalePrice ~ ., data = train.sub) Residuals: Min 1Q Median 3Q Max -469856 -19956 -1360 17200 286304 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -99248.853 4639.866 -21.39 <2e-16 *** OverallQual 23572.236 1072.465 21.98 <2e-16 *** GarageCars 18582.209 1747.412 10.63 <2e-16 *** GrLivArea 45.643 2.484 18.38 <2e-16 *** TotalBsmtSF 32.520 2.838 11.46 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ‘ ’ 1 Residual standard error: 38920 on 1455 degrees of freedom Multiple R-squared: 0.7607, Adjusted R-squared: 0.76 F-statistic: 1156 on 4 and 1455 DF, p-value: < 2.2e-16

Some of the key statistics that are helpful in interpreting a linear regression are as follows:

The R-squared value, or R^{2}, is a measure of goodness-of-fit. It represents the percentage of the variance of the dependent variable (in this case the SalePrice) that is explained collectively by the independent variables. It is therefore a common measure of accuracy of a regression model.

It is common to use the Adjusted R^{2} value because this takes into account the number of independent variables used in the regression and adjusts teh R^{2} such that it increases only if any new independent variables improve the model more than would be expected by chance. Using the Adjusted R^{2} value therefore reduces the risk of getting incorrectly inflated values purely based on having a large number of independent variables.

Our model has an Adjusted R^{2} value of 0.76, meaning that 76% of the variation in SalePrice can be explained by the collective effect of OverallQual, GarageCars, GrLivArea and TotalBsmtSF. Not bad for a first effort.

Another useful set of statistics is the residuals, as these provide an indication of how far out your predictive model is from the real-world values. A residual is the difference between the actual values of your dependent variables and the predicted values. In our case, we have a median residual of -1,360, which means that on ‘average’ our model undershot the SalePrice. That doesn’t seem too bad, however there are clearly some outliers as the largest negative residual is -470k and the largest positive is 286k, indicating that our model under- and overshot some values quite dramatically.

A linear regression model can be used to predict the dependent variable from the independent variables, and the intercepts and coefficients can help us determine how this is done. The regression can be expressed as a formula as follows:

y = β_{0} + β_{1}x_{1} + β_{2}x_{2} …

where *y* is the dependent variable, *β _{0}* is the intercept,

Thus, our linear regression can be expressed as:

*SalePrice = -99248.853 + 23572.236(OverallQual) + 18582.209(GarageCars) + 45.643(GrLivArea) + 32.520(TotalBsmtSF)*

Using this formula, we can predict the value of SalePrice given any new values of independent variables.

However, before using this regression formula, it is also important to look at the other statistics available alongside the coefficients. You will see that we are also given the Standard Error, a T value and a “Pr” value.

The **Standard Error** of a coefficient is an estimate of the standard deviation of the error in measuring it.

The **T value** is the coefficient divided by its standard error. Thus in general, the larger the t value the better, because a large value indicates that the size of the error is small in comparison to the coefficient itself.

The “Pr” value, or simply the **P value** tests the null hypothesis that there is no relationship between the independent variable and dependent variable. It is common to take any value below 0.05 as being grounds to reject the null hypothesis – that is to say, the coefficient *is* statistically significant.

You may remember that we had a train.csv and test.csv dataset. We can now use our linear regression model to predict the SalePrice values of the test.csv dataset and submit it to Kaggle for scoring:

test.sub <- test %>%

select(Id, OverallQual, GarageCars, GrLivArea, TotalBsmtSF)

test.sub$GarageCars[is.na(test.sub$GarageCars)] <- mean(test.sub$GarageCars, na.rm = TRUE)

test.sub$TotalBsmtSF[is.na(test.sub$TotalBsmtSF)] <- mean(test.sub$TotalBsmtSF, na.rm = TRUE)

predictions <- predict(myModel, test.sub)

test.results <- cbind(test.sub, predictions)

test.results <- test.results[,c(1,6)]

colnames(test.results)[2] <- "SalePrice"

write.csv(test.results, "submission.csv", row.names = FALSE)

You will notice that we have had to impute some missing values for GarageCars and TotalBsmtSF as there was one missing value in each of these variables in test.csv. For speed we have simply replaced these missing values with the mean for the entire dataset, however there are better approaches to imputing.

Exporting our predictions to CSV and uploading them to Kaggle yields a score of 0.62 and a lowly 5000^{th} place on the leaderboard. However, as a very quick 2-minute effort there’s lots of room for improvement.

To improve the score, here are some ideas of what to look at:

1. Investigate the **correlation** of other variables with a view to adding more independent variables.

2. Converting categorical or character variables into numeric variables using one-hot encoding or other methods.

3. Standardisation or normalisation of variables.

4. Feature engineering.

I hope you’ve enjoyed reading this tutorial. Please leave any comments or queries in the Comments section below and I’ll do my best to answer them. Thanks for reading.

Tags: dplyr, linear regression, machine learning, normalisation, one-hot encoding, R, standardisation

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.

## One thought on “How to apply and interpret linear regression in R”