Determine the threshold "p" in the logistic regression
Coefficients finding, Confusion Matrix, Decision boundary
Used packages in R
In this post, we will use the following R packages:
library(ggplot2)
library(dplyr)
library(magrittr)
Data generation
Let us assume that we have a dataset related to the University of Iowa’s admission data. There are \(500\) students who applied to the University of Iowa, and some of them obtained an admission from U of Iowa, some of them do not. The data consists of two independent variables about the students; \(x_1\), the average study time
and \(x_2\) the average number of questions
made in the class, and the dependent variable \(y\) is the whether the student obtained an admission from U of Iowa or not.
R code for generating data
set.seed(2020)
x1 <- c(rnorm(250, 5, 2), rnorm(250, 10, 2))
x2 <- c(rnorm(250, 3, 2), rnorm(250, 6, 2))
y <- rep(c(0, 1), c(250, 250))
mydataset <- data.frame(studytime = x1, nquestion = x2, admission = y)
mydataset
studytime | nquestion | admission |
---|---|---|
5.7539442 | 5.413575 | 0 |
5.6030967 | 2.955804 | 0 |
2.8039537 | 5.381202 | 0 |
2.7391882 | 3.213744 | 0 |
-0.5930686 | 6.082115 | 0 |
6.4411470 | 1.243789 | 0 |
Data split
Let us split our data into two groups: the train set and the test set. The training set will take 75% of the data samples chosen randomly, and the test set takes the rest of the data set.
set.seed(2020)
mydataset$id <- 1:nrow(mydataset)
train <- mydataset %>% sample_frac(.75)
test <- anti_join(mydataset, train, by = 'id')
dim(train)
[1] 375 4
dim(test)
[1] 125 4
Data visualization
The scatter plot visualizes the train set. The \(x\) and \(y\) axes represent the average study time
and number of the question
, respectively. Note that the color-coding distinguishes the values of the dependent variable; red for \(0\), green for \(1\). Based on the plot, we can guess that the students who have longer study time and the more number of questions made tend to receive the admission.
R code
p <- ggplot(data = train) +
geom_point(mapping =
aes(x = studytime,
y = nquestion,
color = factor(admission))) +
xlab("Study time") +
ylab("Number of Question") +
scale_colour_discrete(guide = guide_legend(reverse=TRUE),
name="Exam",
labels=c("Fail", "Success")) +
theme_bw()
p
Logistic regression model
The logistic regression model is one of the Generalized linear models (GLMs), which can be thought of as an extension of linear regression. The pair of assumed distribution and the link function \(g\) identifies the GLMs. Note that the link function can be thought as of a bridge between the linear predictor and the mean of the distribution function as follows: \[ \mathbb{E}[Y|X]=X\beta = \mu = g^{-1}(X\beta) \]
The logistic regression assumes the dependent variable follows Bernoulli distribution with logit link \(g\), which can be written as follows; \[ g(x) = log(\frac{x}{1-x}) \] For the convenience of notation, we will denote the inverse of \(g\) by \(\sigma\) function as follows: \[ \sigma(x):=g^{-1}(x) = \frac{e^x}{1+e^x} \]
Since our dependent variable \(Y\) has only two options: 1 (success) and 0 (fail), we can assume that the observations \(y_i\)s, where \(i=1,...,n\), came from the random variable \(Y\) follows Bernoulli random variable. We know that the probability of success, \(p\), is that parameter of the Bernoulli distribution, which is the mean of the distribution.
In the logistic regression, we link the mean \(p\) with the linear predictor using the logit link function. Thus, we can say that that the logistic regression models the dependent variable \(Y\) as follows:
\[ Y \sim Bernulli(\sigma(X\beta)) = Bernulli(\frac{1}{1+e^{-X\beta}}) \]
Estimate the coefficients in R
Calculating the coefficients of logistic regression in R
is quite simple because of the glm
function.
model <- glm(admission ~ studytime + nquestion,
data = train,
family = binomial(link = "logit"))
model$coefficients
(Intercept) studytime nquestion
-13.8252435 1.2792152 0.9604621
In this article, we are going to dive into how to calculate these coefficients numerically.
Likelihood function of the observations
Note that the probability mass function of the Bernoulli r.v. is as follows:
\[ \tag{1} f_Y(y; p) = p^{y}(1-p)^{1-y}, \text{ for }y = 1, 0, \text{ and } 0 \le p \le 1. \]
Then, the likelihood function can be seen as a function of \(\beta\),
\[
\begin{align}
p\left(\beta|\mathbf{X}, \underline{y}\right) & =\prod_{i=1}^{n}p\left(\beta|\mathbf{x}_{i},y_{i}\right)\\
& =\prod_{i=1}^{n}\sigma\left(\mathbf{x}_{i}^{T}\beta\right)^{y_{i}}\left(1-\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)^{1-y_{i}}\\
& =\prod_{i=1}^{n}\pi_{i}^{y_{i}}\left(1-\pi_{i}\right)^{1-y_{i}}
\end{align}
\]
In the last step, I used the notation \(\pi_{i}:=\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\) for the convenience of writing. To find the MLE, we need to find a \(\beta\), which maximizes the likelihood function (or log-likelihood function) above. To use the optimation function in R, we will take a -log
function and find an arg min of it.
\[ \begin{align*} \ell\left(\beta\right) & =-log\left(p\left(\underline{y}|\mathbf{X},\beta\right)\right)\\ & =-\sum_{i=1}^{n}\left\{ y_{i}log\left(\pi_{i}\right)+\left(1-y_{i}\right)log\left(1-\pi_{i}\right)\right\} \\ & =-\sum_{i=1}^{n}\left\{ y_{i}log\left(\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)+\left(1-y_{i}\right)log\left(1-\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)\right\} \end{align*} \]
Thus, we can say that the MLE of \(\beta\), \(\hat{\beta}\) is
\[
\hat{\beta} \overset{set}{=} \underset{\beta}{arg \ min} \ \ \ell\left(\beta\right)
\]
However, the function \(\ell\) does not have a closed-form for the arg min, so we need to use the numerical method. In this article, we will use the optim
function in R to find it.
Define functions
Let us start by defining the sigma function, \(\sigma(\cdot)\), and the log-likelihood function \(\ell(\cdot)\) as follows:
sigma_fcn <- function(x){
1 / (1 + exp(-x))
}
ll_fcn <- function(beta, dataX, dataY){
pi_vec <- sigma_fcn(dataX %*% beta)
-sum(dataY * log(pi_vec) + (1 - dataY) * log(1 - pi_vec))
}
model.matrix
function is useful because it will put the intercept column in the data matrix automatically.
matX <- model.matrix(admission ~ studytime + nquestion, data = train)
y <- train$admission
head(matX)
(Intercept) studytime nquestion
1 1 10.584725 5.413679
2 1 6.595139 1.095518
3 1 1.221887 3.824581
4 1 7.196365 4.029620
5 1 5.211391 6.194372
6 1 10.259252 8.519582
By using the optim
, we can find the same coefficient that we had above.
result <- optim(par = c(1, 1, 1), fn = ll_fcn, dataX = matX, dataY = y)
# convergence check: 0 means success.
result$convergence
[1] 0
# The best set of parameters found.
result$par
[1] -13.8216738 1.2789949 0.9600578
Predictions using train data
After we fit the model, the prediction of the probability of getting admission from the U of Iowa can be made using predict
function. However, it this section, we will still use train data to do the fine-tuning for finding the best \(p\).
predicted_prob <- predict(model, train, type = "response")
If we use \(p\) = 0.5 as a threshold for the decision, then we can make the following prediction evaluation table; this table called confusion matrix
in the machine learning field.
prediction <- as.integer(predicted_prob > 0.5)
confusion_mat <- addmargins(table(train$admission, prediction))
# Labeling
names(dimnames(confusion_mat)) <- c("True status", "Prediction")
colnames(confusion_mat) <- c("Fail", "Success", "Total")
rownames(confusion_mat) <- c("Fail", "Success", "Total")
confusion_mat
Confusion matrix
Table 2 shows the confusion matrix, which corresponds to the predictions with threshold probability 0.5. The interpretation of the table is straight forward; if we use the probability 0.5 as the threshold of the prediction, there are
Fail | Success | Total | |
---|---|---|---|
True status | |||
Fail | 167 | 14 | 181 |
Success | 13 | 181 | 194 |
Total | 180 | 195 | 375 |
167 True Negatives: Prediction as a fail was correct.
14 False Positives: Preddiction as a success was wrong.
13 False Negatives: Preddiction as a success was wrong.
181 True Positives: Right predictions as a success was correct.
Some readers might be confused, which one refers to which. Note that the first part of the term is related to the result of the prediction, and the second part of the term indicates the prediction label. Thus, True positives means the prediction made as Positive status (Non-null status), and the prediction was correct.
Using the confusion matrix, we can evaluate the classification. The most two important measures are as follows:
- False Positive rate: 1 - Specificity (also called Type I error in stat.)
\[ \frac{FP}{TN + FP} = \frac{14}{167 + 14} = 0.0773 \]
- True Positive rate: Sensitivity (also called Power in statistics)
\[ \frac{TP}{FN + TP} = \frac{181}{(13 + 181)} = 0.933 \]
- Accuracy:
\[ \frac{TP + TN}{N} = \frac{167 + 181}{375} = 0.928 \]
Thus, for each value of \(p\), we will get the above performance measures. All the pairs of FP rates and TP rates can be obtained for every \(p \in (0, 1)\). Figure 2, called the ROC curve, which is the scatter plot of the pairs of FP rates and TP rates. The name of the curve, ROC, stands for “receiver operating characteristics,” which originated from communications theory.
R code for Figure 2
library(ROCR)
prediction(predicted_prob, train$admission) %>%
performance(measure = "tpr", x.measure = "fpr") -> result
plotdata <- data.frame(x = result@x.values[[1]],
y = result@y.values[[1]],
p = result@alpha.values[[1]])
p <- ggplot(data = plotdata) +
geom_path(aes(x = x, y = y)) +
xlab(result@x.name) +
ylab(result@y.name) +
theme_bw()
dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
opt_pos <- which.min(dist_vec)
p +
geom_point(data = plotdata[opt_pos, ],
aes(x = x, y = y), col = "red") +
annotate("text",
x = plotdata[opt_pos, ]$x + 0.1,
y = plotdata[opt_pos, ]$y,
label = paste("p =", round(plotdata[opt_pos, ]$p, 3)))
Which \(p\) should we use?
Selecting the best \(p\) is somewhat subjective since it depends on the prediction context.
One of the typical methods of choosing \(p\) is to find the coordinates of FP and TP in the ROC curve, which minimized the distance from the point to (0, 1). In other words, find the \(p\), which makes the decision rule to be a perfect classifier. In Figure 2, the black dot represents the optimal point with threshold \(p\) = 0.468.
However, let us assume that we are dealing with data that the success in the data indicates having a specific disease such as COVID 19. Then, as a model evaluator, we do not want to make False Negative predictions since they have more impact on society than the False Positive predictions.
Imagine that the tester says a person does not have COVID 19, but actually, the person has it.
The same philosophy can be applied to our data set. It is better to have a situation that a student gets admission, yet our prediction says Fail.
As we can see that the power and the False positive rate are trade-off relationship; if we say all the tested people have COVID 19, then the False Negatives will be zero, which is power 1. However, this is just a useless classifier. Thus, we need some criteria or rationale for choosing the best \(p\). One of the possible methods is to control the acceptable level of False Negative rate in advance and find the \(p\) value by minimizing the False Positive rate (Type I error).
Fail | Success | Total | |
---|---|---|---|
True status | |||
Fail | 166 | 15 | 181 |
Success | 12 | 182 | 194 |
Total | 178 | 197 | 375 |
Let us assume that we set our FN rate as 0.05 and try to find the \(p\) value, which makes the lowest FP rate on the training data set. The horizontal line in Figure 3 shows the level of FN rate of 0.02, or we can say 98 % of test power. Therefore, green points above the dashed line represent the possible values of \(p\), satisfying our FN rate criteria. Note that the red dot in the figure indicates FP and TP rates corresponding to the probability value \(p = 0.219\), which has the smallest type II error yet satisfies the FP rate criteria. Thus, we can say that \(p = 0.219\) is the best \(p\) we can choose among these possible options.
R code for Figure 3
min_pos <- which.max(plotdata$y >= 0.98)
p2 +
geom_point(data = plotdata[plotdata$y >= 0.98, ],
aes(x = x, y = y), size = 0.2, col = "green") +
geom_hline(yintercept = 0.98, linetype = "dashed") +
geom_point(data = plotdata[min_pos, ], aes(x = x, y = y), col = "red") +
annotate("text",
x = plotdata[min_pos, ]$x + 0.07,
y = plotdata[min_pos, ]$y - 0.02,
label = paste("p =", round(plotdata[min_pos, ]$p, 3)),
col = "red") +
geom_point(data = plotdata[opt_pos, ],
aes(x = x, y = y), col = "black") +
annotate("text",
x = plotdata[opt_pos, ]$x + 0.05,
y = plotdata[opt_pos, ]$y - 0.05,
label = paste("p =", round(plotdata[opt_pos, ]$p, 3)))
Decision Boundary on the test data
Since we found our \(p\) value, we can visualize it using a decision boundary. Figure 4 shows the logit function, and the horizontal red dashed line represents the threshold \(p\) value that we have found in the previous section. Note that the inverse logit function gives us the \(x\) value -1.27 when the input value of \(y\) is equal to 0.219. Thus, using the \(p\) as a decision threshold means if the value of linear predictor \(X\beta\) is larger than -1.27, we will label it as \(1\) (or Success) in our prediction.
Note that our model has the two variables; study time and number of questions. Using the decision rule above, we can draw the following boundary in our data set:
\[ \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = -1.27 \]
** Estimated \(\beta\) **
model$coefficients
(Intercept) studytime nquestion
-13.8252435 1.2792152 0.9604621
Evaluation
Now, we are going to use the test data
to evaluate the model.
R code for Figure 5
addline <- function(a, b, c, ...){
my_slope <- -a / b
my_intercept <- -c / b
geom_abline(slope = my_slope,
intercept = my_intercept, ...)
}
p_data <- ggplot(data = test) +
geom_point(mapping =
aes(x = studytime,
y = nquestion,
color = factor(admission))) +
xlab("Study time") +
ylab("Number of Question") +
scale_colour_discrete(guide = guide_legend(reverse=TRUE),
name="Exam",
labels=c("Fail", "Success")) +
theme_bw()
p_data +
addline(model$coefficients[2],
model$coefficients[3],
model$coefficients[1] + 1.27, linetype = "solid", col = "blue") +
addline(model$coefficients[2],
model$coefficients[3],
model$coefficients[1] + 0.13, linetype = "dashed")
In Figure 5, the solid blue line indicates the decision boundary induced by \(p = 0.219\) threshold. Compare to the black dashed line, which is the decision boundary corresponding \(p=0.468\); the solid blue line has a lower number of green dots on the left-hand side. The smaller number of green dots on the left-hand side means the smaller False Negatives.
Confusion matrix on test data
Fail | Success | Total | |
---|---|---|---|
True status | |||
Fail | 64 | 5 | 69 |
Success | 6 | 50 | 56 |
Total | 70 | 55 | 125 |
Fail | Success | Total | |
---|---|---|---|
True status | |||
Fail | 55 | 14 | 69 |
Success | 2 | 54 | 56 |
Total | 57 | 68 | 125 |
Table 4 and Table 5 show the confusion matrices on test data set with the threshold \(p\) of 0.468 and 0.219, respectively. By the definition of the FP rate and the TP rate, we have
False Positive rate:
with p = 0.468 \[ \frac{FP}{TN + FP} = \frac{5}{64 + 5} = 0.072 \] with p = 0.219 \[ \frac{FP}{TN + FP} = \frac{14}{55 + 14} = 0.203 \]
True Positive rate:
with p = 0.468 \[ \frac{TP}{FN + TP} = \frac{50}{(6 + 50)} = 0.893 \] with p = 0.219 \[ \frac{TP}{FN + TP} = \frac{54}{(2 + 54)} = 0.964 \]Accuracy:
with p = 0.468 \[ \frac{TP + TN}{N} = \frac{64 + 50}{125} = 0.912 \] with p = 0.219 \[ \frac{TP + TN}{N} = \frac{55 + 54}{125} = 0.872 \]
As we can see, the prediction with a threshold of 0.468 has a higher Accuracy
than the prediction with a threshold of 0.219. However, when it comes to TP rate
, the decision with a threshold of 0.219 has a much higher rate than the threshold of 0.468.