We can describe the bias-variance tradeoff through the following equation:
\[
Error = \sigma^2 + Bias^2(\hat{f}(x)) + Var(\hat{f}(x))
\]
where:
\[
\begin{align}
Bias^2(\hat{f}(x)) &= E[\hat{f}(x) - f(x)] \\
Var(\hat{f}(x)) &= E \bigg( \big ( E[\hat{f}(x)] - \hat{f}(x)\big )^2 \bigg ) \\
\sigma^2 &= E[(y - f(x))^2]
\end{align}
\]
(\(\sigma^2\) is the variance of the unobserved error (hence noise)). High-bias models tend (are likely) to underfit; high-variance models tend (are likely) to overfit. Trying to lower the bias typically increases variance and vice a versa, hence the trade off.
Note that we can choose a biased estimator as long as it reduces our variance by more than the square of the bias (and vice a versa)! For instance, L2-regularization is an example of biased estimators that if tuned correctly can reduce error.
Bias-variance tradeoff for classification: If instead of MSE we use 0-1 loss, the bias-variance tradeoff does no longer hold. In fact, in that case, bias and variance combine multiplicatively (see elements of statistical learning, exercise 7.2). If the estimate is on the correct side of the decision boundary, then the bias is negative, and decreasing the variance will decrease the miclassification rate. But if the estimate is on the worng side of the decision boundary, then the bias is positive, so it pays to increase the variance.