Determine the threshold "p" in the logistic regression

Coefficients finding, Confusion Matrix, Decision boundary

Used packages in R

In this post, we will use the following R packages:

library(ggplot2)
library(dplyr)
library(magrittr)

Data generation

Let us assume that we have a dataset related to the University of Iowa’s admission data. There are \(500\) students who applied to the University of Iowa, and some of them obtained an admission from U of Iowa, some of them do not. The data consists of two independent variables about the students; \(x_1\), the average study time and \(x_2\) the average number of questions made in the class, and the dependent variable \(y\) is the whether the student obtained an admission from U of Iowa or not.

R code for generating data

set.seed(2020)
x1 <- c(rnorm(250, 5, 2), rnorm(250, 10, 2))
x2 <- c(rnorm(250, 3, 2), rnorm(250, 6, 2))
y <- rep(c(0, 1), c(250, 250))
mydataset <- data.frame(studytime = x1, nquestion = x2, admission = y)
mydataset
Table 1: The structure of the data set
studytime nquestion admission
5.7539442 5.413575 0
5.6030967 2.955804 0
2.8039537 5.381202 0
2.7391882 3.213744 0
-0.5930686 6.082115 0
6.4411470 1.243789 0

Data split

Let us split our data into two groups: the train set and the test set. The training set will take 75% of the data samples chosen randomly, and the test set takes the rest of the data set.

set.seed(2020)
mydataset$id <- 1:nrow(mydataset)
train <- mydataset %>% sample_frac(.75)
test  <- anti_join(mydataset, train, by = 'id')
dim(train)
[1] 375   4
dim(test)
[1] 125   4

Data visualization

The scatter plot visualizes the train set. The \(x\) and \(y\) axes represent the average study time and number of the question, respectively. Note that the color-coding distinguishes the values of the dependent variable; red for \(0\), green for \(1\). Based on the plot, we can guess that the students who have longer study time and the more number of questions made tend to receive the admission.

R code

p <- ggplot(data = train) +
  geom_point(mapping =
               aes(x = studytime,
                   y = nquestion,
               color = factor(admission))) +
  xlab("Study time") +
  ylab("Number of Question") +
  scale_colour_discrete(guide = guide_legend(reverse=TRUE),
                        name="Exam",
                        labels=c("Fail", "Success")) + 
  theme_bw()
p
The scatter plot of the train data

Figure 1: The scatter plot of the train data

Logistic regression model

The logistic regression model is one of the Generalized linear models (GLMs), which can be thought of as an extension of linear regression. The pair of assumed distribution and the link function \(g\) identifies the GLMs. Note that the link function can be thought as of a bridge between the linear predictor and the mean of the distribution function as follows: \[ \mathbb{E}[Y|X]=X\beta = \mu = g^{-1}(X\beta) \]

The logistic regression assumes the dependent variable follows Bernoulli distribution with logit link \(g\), which can be written as follows; \[ g(x) = log(\frac{x}{1-x}) \] For the convenience of notation, we will denote the inverse of \(g\) by \(\sigma\) function as follows: \[ \sigma(x):=g^{-1}(x) = \frac{e^x}{1+e^x} \]

Since our dependent variable \(Y\) has only two options: 1 (success) and 0 (fail), we can assume that the observations \(y_i\)s, where \(i=1,...,n\), came from the random variable \(Y\) follows Bernoulli random variable. We know that the probability of success, \(p\), is that parameter of the Bernoulli distribution, which is the mean of the distribution.

In the logistic regression, we link the mean \(p\) with the linear predictor using the logit link function. Thus, we can say that that the logistic regression models the dependent variable \(Y\) as follows:

\[ Y \sim Bernulli(\sigma(X\beta)) = Bernulli(\frac{1}{1+e^{-X\beta}}) \]

Estimate the coefficients in R

Calculating the coefficients of logistic regression in R is quite simple because of the glm function.

model <- glm(admission ~ studytime + nquestion, 
             data = train,
             family = binomial(link = "logit"))
model$coefficients
(Intercept)   studytime   nquestion 
-13.8252435   1.2792152   0.9604621 

In this article, we are going to dive into how to calculate these coefficients numerically.

Likelihood function of the observations

Note that the probability mass function of the Bernoulli r.v. is as follows:

\[ \tag{1} f_Y(y; p) = p^{y}(1-p)^{1-y}, \text{ for }y = 1, 0, \text{ and } 0 \le p \le 1. \]

Then, the likelihood function can be seen as a function of \(\beta\),

\[ \begin{align} p\left(\beta|\mathbf{X}, \underline{y}\right) & =\prod_{i=1}^{n}p\left(\beta|\mathbf{x}_{i},y_{i}\right)\\ & =\prod_{i=1}^{n}\sigma\left(\mathbf{x}_{i}^{T}\beta\right)^{y_{i}}\left(1-\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)^{1-y_{i}}\\ & =\prod_{i=1}^{n}\pi_{i}^{y_{i}}\left(1-\pi_{i}\right)^{1-y_{i}} \end{align} \] In the last step, I used the notation \(\pi_{i}:=\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\) for the convenience of writing. To find the MLE, we need to find a \(\beta\), which maximizes the likelihood function (or log-likelihood function) above. To use the optimation function in R, we will take a -log function and find an arg min of it.

\[ \begin{align*} \ell\left(\beta\right) & =-log\left(p\left(\underline{y}|\mathbf{X},\beta\right)\right)\\ & =-\sum_{i=1}^{n}\left\{ y_{i}log\left(\pi_{i}\right)+\left(1-y_{i}\right)log\left(1-\pi_{i}\right)\right\} \\ & =-\sum_{i=1}^{n}\left\{ y_{i}log\left(\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)+\left(1-y_{i}\right)log\left(1-\sigma\left(\mathbf{x}_{i}^{T}\beta\right)\right)\right\} \end{align*} \]

Thus, we can say that the MLE of \(\beta\), \(\hat{\beta}\) is \[ \hat{\beta} \overset{set}{=} \underset{\beta}{arg \ min} \ \ \ell\left(\beta\right) \] However, the function \(\ell\) does not have a closed-form for the arg min, so we need to use the numerical method. In this article, we will use the optim function in R to find it.

Define functions

Let us start by defining the sigma function, \(\sigma(\cdot)\), and the log-likelihood function \(\ell(\cdot)\) as follows:

sigma_fcn <- function(x){
  1 / (1 + exp(-x))
}

ll_fcn <- function(beta, dataX, dataY){
    pi_vec <- sigma_fcn(dataX %*% beta)
    -sum(dataY * log(pi_vec) + (1 - dataY) * log(1 - pi_vec))
}

model.matrix function is useful because it will put the intercept column in the data matrix automatically.

matX <- model.matrix(admission ~ studytime + nquestion, data = train)
y <- train$admission
head(matX)
  (Intercept) studytime nquestion
1           1 10.584725  5.413679
2           1  6.595139  1.095518
3           1  1.221887  3.824581
4           1  7.196365  4.029620
5           1  5.211391  6.194372
6           1 10.259252  8.519582

By using the optim, we can find the same coefficient that we had above.

result <- optim(par = c(1, 1, 1), fn = ll_fcn, dataX = matX, dataY = y) 

# convergence check: 0 means success.
result$convergence
[1] 0
# The best set of parameters found.
result$par
[1] -13.8216738   1.2789949   0.9600578

Predictions using train data

After we fit the model, the prediction of the probability of getting admission from the U of Iowa can be made using predict function. However, it this section, we will still use train data to do the fine-tuning for finding the best \(p\).

predicted_prob <- predict(model, train, type = "response")

If we use \(p\) = 0.5 as a threshold for the decision, then we can make the following prediction evaluation table; this table called confusion matrix in the machine learning field.

prediction <- as.integer(predicted_prob > 0.5)
confusion_mat <- addmargins(table(train$admission, prediction))
# Labeling
names(dimnames(confusion_mat)) <- c("True status", "Prediction")
colnames(confusion_mat) <- c("Fail", "Success", "Total")
rownames(confusion_mat) <- c("Fail", "Success", "Total")
confusion_mat

Confusion matrix

Table 2 shows the confusion matrix, which corresponds to the predictions with threshold probability 0.5. The interpretation of the table is straight forward; if we use the probability 0.5 as the threshold of the prediction, there are

Table 2: Confusion Matrix with p = 0.5
Prediction
Fail Success Total
True status
Fail 167 14 181
Success 13 181 194
Total 180 195 375
  • 167 True Negatives: Prediction as a fail was correct.

  • 14 False Positives: Preddiction as a success was wrong.

  • 13 False Negatives: Preddiction as a success was wrong.

  • 181 True Positives: Right predictions as a success was correct.

Some readers might be confused, which one refers to which. Note that the first part of the term is related to the result of the prediction, and the second part of the term indicates the prediction label. Thus, True positives means the prediction made as Positive status (Non-null status), and the prediction was correct.

Using the confusion matrix, we can evaluate the classification. The most two important measures are as follows:

  • False Positive rate: 1 - Specificity (also called Type I error in stat.)

\[ \frac{FP}{TN + FP} = \frac{14}{167 + 14} = 0.0773 \]

  • True Positive rate: Sensitivity (also called Power in statistics)

\[ \frac{TP}{FN + TP} = \frac{181}{(13 + 181)} = 0.933 \]

  • Accuracy:

\[ \frac{TP + TN}{N} = \frac{167 + 181}{375} = 0.928 \]

Thus, for each value of \(p\), we will get the above performance measures. All the pairs of FP rates and TP rates can be obtained for every \(p \in (0, 1)\). Figure 2, called the ROC curve, which is the scatter plot of the pairs of FP rates and TP rates. The name of the curve, ROC, stands for “receiver operating characteristics,” which originated from communications theory.

R code for Figure 2

library(ROCR)
prediction(predicted_prob, train$admission) %>%
  performance(measure = "tpr", x.measure = "fpr") -> result

plotdata <- data.frame(x = result@x.values[[1]],
                       y = result@y.values[[1]], 
                       p = result@alpha.values[[1]])

p <- ggplot(data = plotdata) +
  geom_path(aes(x = x, y = y)) + 
  xlab(result@x.name) +
  ylab(result@y.name) +
  theme_bw()

dist_vec <- plotdata$x^2 + (1 - plotdata$y)^2
opt_pos <- which.min(dist_vec)

p + 
  geom_point(data = plotdata[opt_pos, ], 
             aes(x = x, y = y), col = "red") +
  annotate("text", 
           x = plotdata[opt_pos, ]$x + 0.1,
           y = plotdata[opt_pos, ]$y,
           label = paste("p =", round(plotdata[opt_pos, ]$p, 3)))
The ROC curve for given logistic regression

Figure 2: The ROC curve for given logistic regression

Which \(p\) should we use?

Selecting the best \(p\) is somewhat subjective since it depends on the prediction context.

One of the typical methods of choosing \(p\) is to find the coordinates of FP and TP in the ROC curve, which minimized the distance from the point to (0, 1). In other words, find the \(p\), which makes the decision rule to be a perfect classifier. In Figure 2, the black dot represents the optimal point with threshold \(p\) = 0.468.

However, let us assume that we are dealing with data that the success in the data indicates having a specific disease such as COVID 19. Then, as a model evaluator, we do not want to make False Negative predictions since they have more impact on society than the False Positive predictions.

Imagine that the tester says a person does not have COVID 19, but actually, the person has it.

The same philosophy can be applied to our data set. It is better to have a situation that a student gets admission, yet our prediction says Fail.

As we can see that the power and the False positive rate are trade-off relationship; if we say all the tested people have COVID 19, then the False Negatives will be zero, which is power 1. However, this is just a useless classifier. Thus, we need some criteria or rationale for choosing the best \(p\). One of the possible methods is to control the acceptable level of False Negative rate in advance and find the \(p\) value by minimizing the False Positive rate (Type I error).

Table 3: Confusion Matrix with p = 0.468
Prediction
Fail Success Total
True status
Fail 166 15 181
Success 12 182 194
Total 178 197 375

Let us assume that we set our FN rate as 0.05 and try to find the \(p\) value, which makes the lowest FP rate on the training data set. The horizontal line in Figure 3 shows the level of FN rate of 0.02, or we can say 98 % of test power. Therefore, green points above the dashed line represent the possible values of \(p\), satisfying our FN rate criteria. Note that the red dot in the figure indicates FP and TP rates corresponding to the probability value \(p = 0.219\), which has the smallest type II error yet satisfies the FP rate criteria. Thus, we can say that \(p = 0.219\) is the best \(p\) we can choose among these possible options.

R code for Figure 3

min_pos <- which.max(plotdata$y >= 0.98)

p2 + 
  geom_point(data = plotdata[plotdata$y >= 0.98, ], 
             aes(x = x, y = y), size = 0.2, col = "green") + 
  geom_hline(yintercept = 0.98, linetype = "dashed") +
  geom_point(data = plotdata[min_pos, ], aes(x = x, y = y), col = "red") +
  annotate("text", 
           x = plotdata[min_pos, ]$x + 0.07,
           y = plotdata[min_pos, ]$y - 0.02,
           label = paste("p =", round(plotdata[min_pos, ]$p, 3)),
           col = "red") + 
  geom_point(data = plotdata[opt_pos, ], 
             aes(x = x, y = y), col = "black") +
  annotate("text", 
           x = plotdata[opt_pos, ]$x + 0.05,
           y = plotdata[opt_pos, ]$y - 0.05,
           label = paste("p =", round(plotdata[opt_pos, ]$p, 3)))
Finding smallest FP rate under 2% FN rate

Figure 3: Finding smallest FP rate under 2% FN rate

Decision Boundary on the test data

Since we found our \(p\) value, we can visualize it using a decision boundary. Figure 4 shows the logit function, and the horizontal red dashed line represents the threshold \(p\) value that we have found in the previous section. Note that the inverse logit function gives us the \(x\) value -1.27 when the input value of \(y\) is equal to 0.219. Thus, using the \(p\) as a decision threshold means if the value of linear predictor \(X\beta\) is larger than -1.27, we will label it as \(1\) (or Success) in our prediction.

Logit function

Figure 4: Logit function

Note that our model has the two variables; study time and number of questions. Using the decision rule above, we can draw the following boundary in our data set:

\[ \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = -1.27 \]

** Estimated \(\beta\) **

model$coefficients
(Intercept)   studytime   nquestion 
-13.8252435   1.2792152   0.9604621 

Evaluation

Now, we are going to use the test data to evaluate the model.

R code for Figure 5

addline <- function(a, b, c, ...){
  my_slope <- -a / b
  my_intercept <- -c / b
  geom_abline(slope = my_slope,
              intercept = my_intercept, ...)
}

p_data <- ggplot(data = test) +
  geom_point(mapping =
               aes(x = studytime,
                   y = nquestion,
               color = factor(admission))) +
  xlab("Study time") +
  ylab("Number of Question") +
  scale_colour_discrete(guide = guide_legend(reverse=TRUE),
                        name="Exam",
                        labels=c("Fail", "Success")) + 
  theme_bw()

p_data + 
  addline(model$coefficients[2],
                 model$coefficients[3],
                 model$coefficients[1] + 1.27, linetype = "solid", col = "blue") +
  addline(model$coefficients[2],
                 model$coefficients[3],
                 model$coefficients[1] + 0.13, linetype = "dashed")
Decision boundary with $p=0.219$ (solid) and $p=0.468$ (dashed)

Figure 5: Decision boundary with \(p=0.219\) (solid) and \(p=0.468\) (dashed)

In Figure 5, the solid blue line indicates the decision boundary induced by \(p = 0.219\) threshold. Compare to the black dashed line, which is the decision boundary corresponding \(p=0.468\); the solid blue line has a lower number of green dots on the left-hand side. The smaller number of green dots on the left-hand side means the smaller False Negatives.

Confusion matrix on test data

Table 4: Confusion Matrix with p = 0.468
Prediction
Fail Success Total
True status
Fail 64 5 69
Success 6 50 56
Total 70 55 125
Table 5: Confusion Matrix with p = 0.219
Prediction
Fail Success Total
True status
Fail 55 14 69
Success 2 54 56
Total 57 68 125

Table 4 and Table 5 show the confusion matrices on test data set with the threshold \(p\) of 0.468 and 0.219, respectively. By the definition of the FP rate and the TP rate, we have

  • False Positive rate:

    with p = 0.468 \[ \frac{FP}{TN + FP} = \frac{5}{64 + 5} = 0.072 \] with p = 0.219 \[ \frac{FP}{TN + FP} = \frac{14}{55 + 14} = 0.203 \]

  • True Positive rate:

    with p = 0.468 \[ \frac{TP}{FN + TP} = \frac{50}{(6 + 50)} = 0.893 \] with p = 0.219 \[ \frac{TP}{FN + TP} = \frac{54}{(2 + 54)} = 0.964 \]
  • Accuracy:

    with p = 0.468 \[ \frac{TP + TN}{N} = \frac{64 + 50}{125} = 0.912 \] with p = 0.219 \[ \frac{TP + TN}{N} = \frac{55 + 54}{125} = 0.872 \]

As we can see, the prediction with a threshold of 0.468 has a higher Accuracy than the prediction with a threshold of 0.219. However, when it comes to TP rate, the decision with a threshold of 0.219 has a much higher rate than the threshold of 0.468.

Avatar
Issac Lee
PhD candidate

I believe anyone can learn anything with a proper education process.

comments powered by Disqus