The Statistics Professor

How do I train a neural network? How do I use a LSTM for time series data? When we think of machine learning or do google searches typically we are bombarded with answers to these sort of questions.

However, like many areas in mathematics, statistics, having a good foundation is key. Hence it is fitting that for my first blog post I discuss an elementary yet powerful approach one can use to find relationships between data; namely regression analysis.

Suppose we are interested in predicting house prices. We can think of a number of factors which would impact housing prices, i.e. the number of its square feet, number of rooms, size of garden, distance from hospitals, shops etc.

We can mathematically express this as the following:

$Y_{i} = f(X_{1,i},\ldots,X_{k,i})$

Here $Y_{i}$ denotes the price of the $i$ -th house and $X_{1,i},\ldots,X_{k,i}$ correspond to $k$ different variables which affect housing prices. For example $X_{1,i}$ may be the square feet of the $i$ -th house and $X_{k,i}$ may correspond to the $i$ -th house’s closest distance to a hospital.

We can think of the function $f$ as the unobserved process which describes the relationship between the independent variables (what we use to predict) and the dependent variable (what we want to predict). Obviously if we knew the form of the function $f$ , building a machine learning algorithm for prediction would be pointless. It is important to note that $f$ may not be a linear combination of the variables. When we use a linear regression approach we merely try to approximate this relationship and hope it is sufficiently good.