Naive Bayes Model

A self use function for Naive Bayes classifier
ML
Classification
Author

Creo Hsia

Published

October 14, 2024


We build a Naive Bayes classifier that can handle any classification problem, assuming continuous variables follow a normal distribution. This classifier is not limited to the specific problem in the exercises.

1 Function Bayesclasifier

The following function fits a Naive Bayes model to the data, estimating prior probabilities and conditional probabilities for both continuous and categorical features. This function takes four inputs: the data, the response (a character string specifying which column contains the class labels), the prior (if provided), and the Laplace smoothing parameter. The output is a list containing the conditional probabilities for each variable, the prior probabilities, an indicator of which columns are numeric or categorical, and the levels of the class labels. The returned list is of class naiveBayes, which allows for method dispatching specific to this model.

Code
Bayesclasifier <- function(data, response, prior = NULL, alpha = 0) {
  # Step 1: Extract the target variable (response)
  # 'label' contains the target class from the 'response' column of the data
  label <- data[[response]]
  label_counts <- table(label)
  level <- unique(label)# Identify unique classes in the target variable
  # Remove the response column from the data 
  #(since it should not be used as a predictor)
  data[[response]] <- NULL
  # Step 2: Calculate prior probabilities if not provided
  if (is.null(prior)) {
    prior <- prop.table(label_counts)
  }
  # Step 3: Identify numeric and categorical columns
  is_num <- sapply(data, function(x) is.numeric(x))
  # Step 4: Process numeric and categorical columns separately
  tbl <- lapply(data,function(col){
    if (is.numeric(col)){
      # For numeric columns, calculate the mean and variance
      #for each class in the target variable
      means <- tapply(col, label, mean,na.rm = TRUE)
      sd <- tapply(col, label, sd,na.rm = TRUE)
      cbind(means,sd)
    }
    else {
      # For categorical columns, calculate smoothed counts 
      # and conditional probabilities
      # Do not display the count of NA values (this is the default behavior)
      counts <- table(label, col)
      # Apply Laplace smoothing to avoid zero probabilities
      smoothed_probs <- sweep(counts + alpha, 1,
                              label_counts+alpha*ncol(counts), "/")
      # Calculate conditional probabilities
      return(smoothed_probs)
    }
  })
  # Return the model as a list containing: 
  #1) prior probabilities, 2) tables of statistics, 3) and metadata
  result <- list(
    prior = prior, tables = tbl, 
    isnumeric=is_num,
    level=level)
  # Assign a class to the result for method dispatch
  class(result) <- "naiveBayes_clasifier"
  return(result)
} 

2 Function predict.naiveBayes_clasifier

The predict function takes a trained Naive Bayes model and new data as inputs, and returns predicted class labels or class probabilities depending on the specified type. For missing values (NA), the function handles them by assigning neutral probabilities (effectively ignoring them in the likelihood calculation).

Code
predict.naiveBayes_clasifier <- function(model, newdata, type = "class") {
  logprior <- log(model$prior)   
  len <- length(model$prior)
  # Match newdata columns with model attributes
  match_idx <- match(names(model$tables), names(newdata))  
  
  # Step 3: Preallocate log probability matrix for faster computation
  Logprob <- matrix(0, nrow = len, ncol = nrow(newdata),dimnames = list(model$level,NULL))
  # Step 1: Convert categorical variables to factors with matching levels
  for (v in seq_along(match_idx)) {
    if (model$isnumeric[v]){
      nd <- newdata[, match_idx[v]]
      tbl <- model$tables[[v]]
      means <- tbl[, 1]
      sd <- tbl[, 2]
      notna <- !is.na(nd)
      Logprob[,notna] <-Logprob[,notna]+ vapply(nd[notna],function(x) dnorm(x,means,sd,log=TRUE),numeric(len))
    }
    else {
      nd <- factor(newdata[,match_idx[v]],
                             levels = colnames(model$tables[[v]]))
      notna <- !is.na(nd)
      Logprob[,notna] <-Logprob[,notna]+ log(model$tables[[v]][,nd[notna]])
    }
  }
  # Step 5: Add log-priors to the computed log-likelihoods
  Logprob <- sweep(Logprob, 1, logprior, "+")
  # Step 6: Return predictions based on the specified type ('class' or 'raw')
  if (type == "class") {
    pred_classes <- model$level[apply(Logprob,2,which.max)]
    return(pred_classes)
  } else {
    # If type == "raw", return probabilities for all classes
    probs <- exp(Logprob)  # Convert log-probabilities back to normal scale
    # Normalize probabilities so they sum to 1
    probs <- apply(probs,2,function(x)x/sum(x) ) 
    return(t(probs))
  }
}

3 Function confusion_matrix and roc_auc

The first function computes the confusion matrix, which compares the true class labels with the predicted class labels to evaluate the performance of the classification model.

  • Input:

    • true_labels: A vector of true class labels for the dataset.

    • predicted_class: A vector of predicted class labels for the dataset. This is typically the output after selecting the class with the highest predicted probability for each instance.

  • Output:

    • A confusion matrix, where rows represent the actual class labels and columns represent the predicted class labels. 

The second function computes the ROC curve and Area Under the Curve (AUC) for each class in the classification problem.

  • Input: the same as above

  • Output:

    A list containing:

    • roc_data: A data frame with columns FPR (False Positive Rate), TPR (True Positive Rate), and class for each class, representing the ROC curve points for each class.

    • auc: A named vector where each entry is the AUC value for one class, with the class names as the vector names.

While these functions can handle binary classification, they are especially suited for multi-class classification problems. In multi-class classification, the “one-vs-all” (or “one-vs-rest”) approach is applied. This means that for each class, the function treats it as the positive class and considers all other classes as the negative class.

Code
# Function to create a confusion matrix based on true labels and predicted class labels
confusion_matrix <- function(true_labels, predicted_class) {
  class_names <- unique(c(true_labels, predicted_class))
  
  # Create an empty confusion matrix
  conf_matrix <- matrix(0, nrow = length(class_names), ncol = length(class_names),
                        dimnames = list("Actual" = class_names, "Predicted" = class_names))
  
  # Populate the confusion matrix
  for (i in seq_along(true_labels)) {
    true_class <- true_labels[i]
    pred_class <- predicted_class[i]
    conf_matrix[true_class, pred_class] <- conf_matrix[true_class, pred_class] + 1
  }
  
  return(conf_matrix)
}
roc_auc <- function(true_labels, predicted_probs) {
  # Get unique class names
  classes <- colnames(predicted_probs)
  
  # Internal function to compute sensitivity and specificity with vectorized operations
  compute_roc_auc <- function(binary_labels, pos_prob) {
    # Sort the predicted probabilities and corresponding labels once
    order_index <- order(pos_prob, decreasing = TRUE)
    sorted_labels <- binary_labels[order_index]
    
    # Number of positive and negative samples
    M <- sum(sorted_labels == 1)
    N <- sum(sorted_labels == 0)
    
    # Cumulative sum to calculate sensitivity (sen) and specificity (spe)
    sen <- cumsum(sorted_labels == 1)  # True positives as we go down the list
    spe <- cumsum(sorted_labels == 0)  # False positives as we go down the list
    
    # Sensitivity and specificity as ratios
    TPR <- c(0, sen / M)
    FPR <- c(0, spe / N)
    
    # Calculate AUC using the trapezoidal rule (efficiently using vectorized operations)
    auc <- sum((FPR[-1] - FPR[-length(FPR)]) * (TPR[-1] + TPR[-length(TPR)])) / 2
    
    # Return results as a matrix for better performance
    return(list(roc = data.frame(FPR = FPR, TPR = TPR), auc = auc))
  }
  
  # Check if it's a binary classification problem
  if (length(classes) == 2) {
    class_name <- classes[1]
    binary_labels <- ifelse(true_labels == class_name, 1, 0)
    pos_prob <- predicted_probs[, class_name]
    
    # Compute ROC for the binary classification
    roc_auc_result <- compute_roc_auc(binary_labels, pos_prob)
    roc_data <- roc_auc_result$roc
    roc_data$class <- class_name  # Add class column
    auc <- setNames(roc_auc_result$auc, class_name)  # Return AUC as named vector
    
  } else {
    # For multi-class classification, use one-vs-all approach
    roc_data_list <- list()
    auc <- setNames(rep(NA, length(classes)), classes)  # Initialize a named vector for AUC
    
    # Loop through each class and calculate sensitivity and specificity
    for (class_name in classes) {
      binary_labels <- ifelse(true_labels == class_name, 1, 0)
      pos_prob <- predicted_probs[, class_name]
      
      # Compute ROC for the current class
      class_roc_auc_result <- compute_roc_auc(binary_labels, pos_prob)
      class_roc_data <- class_roc_auc_result$roc
      class_roc_data$class <- class_name  # Add class column
      
      # Store ROC data and AUC for the current class
      roc_data_list[[class_name]] <- class_roc_data
      auc[class_name] <- class_roc_auc_result$auc  # Assign AUC to the named vector
    }
    
    # Combine ROC results for all classes into a single data frame
    roc_data <- do.call(rbind, roc_data_list)
  }
  result <- list(roc_data = roc_data, auc = auc)
  class(result) <- "roc_auc"
  return(result)
}

4 Function plot.roc_auc

A ggplot2 plot displaying the ROC curves for all classes. Each class is represented by a different color, and the AUC values are shown on the plot to summarize the classifier’s performance for each class.

Code
plot.roc_auc <- function(x, ...) {
  # Extract ROC data and AUC values
  roc_data <- x$roc_data
  auc_values <- x$auc
  
  # Create a data frame for AUC values to annotate
  auc_annotations <- data.frame(
    class = names(auc_values),
    auc = round(auc_values, 4),
    label = paste(names(auc_values), ": ", round(auc_values, 4), sep = "")
  )
  
  # Plot all ROC curves, using different colors for each class
  p <- ggplot(roc_data, aes(x = FPR, y = TPR, color = class)) +
    geom_line(linewidth = 1) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    labs(title = "ROC Curves for All Classes",
         x = "False Positive Rate", y = "True Positive Rate") +
    theme_minimal()
  
  # Annotate AUC values at the top-right corner without any box or frame
  auc_text <- paste("AUC:\n", paste(auc_annotations$label, collapse = "\n"))
  
  p <- p + annotate("text", x = 0.8, y = 0.6, hjust = 0,
                    label = auc_text, size = 3, color = "black")
  
  # Display the plot
  return(p)
}

5 A real example

We apply the custom Naive Bayes classifier to the Adult dataset for classification (the data can be found in Github) . We also compare its performance in terms of speed with the naiveBayes function from the e1071 R package.

library(ggplot2)
data <- read.csv("adult.data",
                 na.strings = " ?", header=FALSE)
newdata <- read.table("adult.test",
                      sep=",", skip = 1,na.strings = " ?")
newdata$V15 <- gsub("\\.$", "", newdata$V15)
system.time({
  model <- Bayesclasifier(data,"V15",alpha = 1)
  pred <- predict(model,newdata,"raw")  
  roc <- roc_auc(newdata$V15,pred)
})
#>|    user  system elapsed 
#>|   0.159   0.006   0.165
library(e1071)
system.time({
  model2 <- e1071::naiveBayes(V15~.,data=data)
  pred2 <- predict(model2,newdata)
})
#>|    user  system elapsed 
#>|   1.299   0.006   1.305

Faster than the function in e1071.

confusion_matrix(newdata$V15,pred <- predict(model,newdata,"class"))
#>|         Predicted
#>| Actual    <=50K  >50K
#>|    <=50K  11547   888
#>|    >50K    1863  1983
plot(roc)

6 Function print.naiveBayes_clasifier

This function prints a summary of the fitted model, including the prior probabilities, the conditional probabilities for each variable, and an indication of whether each variable is numeric or categorical.

Code
print.naiveBayes_clasifier <- function(model) {
  cat("Naive Bayes Model\n")
  cat("=================\n")
  
  # Print class levels
  cat("\nClass Levels:\n")
  print(model$level)
  
  # Print prior probabilities
  cat("\nPrior Probabilities:\n")
  print(model$prior)
  
  # Loop through the model tables to display statistics for each feature
  cat("\nFeature Statistics:\n")
  for (i in seq_along(model$tables)) {
    feature_name <- names(model$tables)[i]
    cat("\nFeature:", feature_name, "\n")
    
    if (model$isnumeric[i]) {
      # For numeric features, print mean and standard deviation
      cat("Type: Numeric\n")
      stats <- model$tables[[i]]
      print(data.frame(
        Mean = stats[, 1], SD = stats[, 2]))
    } else {
      # For categorical features, print conditional probabilities
      cat("Type: Categorical\n")
      probs <- model$tables[[i]]
      print(as.data.frame.matrix(probs))
    }
  }
  
  cat("\n=================\n")
}
Code
print(model)
#>| Naive Bayes Model
#>| =================
#>| 
#>| Class Levels:
#>| [1] " <=50K" " >50K" 
#>| 
#>| Prior Probabilities:
#>| label
#>|     <=50K      >50K 
#>| 0.7591904 0.2408096 
#>| 
#>| Feature Statistics:
#>| 
#>| Feature: V1 
#>| Type: Numeric
#>|            Mean       SD
#>|  <=50K 36.78374 14.02009
#>|  >50K  44.24984 10.51903
#>| 
#>| Feature: V2 
#>| Type: Categorical
#>|         Federal-gov  Local-gov  Never-worked   Private  Self-emp-inc
#>|  <=50K   0.02385959 0.05972986  0.0003235199 0.7171627    0.02001779
#>|  >50K    0.04739457 0.07873614  0.0001274048 0.6324373    0.07937317
#>|         Self-emp-not-inc  State-gov  Without-pay
#>|  <=50K        0.07351990 0.03825623 0.0006065998
#>|  >50K         0.09236845 0.04510129 0.0001274048
#>| 
#>| Feature: V3 
#>| Type: Numeric
#>|            Mean       SD
#>|  <=50K 190340.9 106482.3
#>|  >50K  188005.0 102541.8
#>| 
#>| Feature: V4 
#>| Type: Categorical
#>|               10th        11th        12th      1st-4th     5th-6th     7th-8th
#>|  <=50K 0.035252264 0.045116429 0.016211190 0.0065895860 0.012855757 0.024539133
#>|  >50K  0.008018328 0.007763778 0.004327351 0.0008909253 0.002163676 0.005218277
#>|                9th  Assoc-acdm  Assoc-voc  Bachelors   Doctorate   HS-grad
#>|  <=50K 0.019728331  0.03246281 0.04131630  0.1267384 0.004366106 0.3568483
#>|  >50K  0.003563701  0.03385516 0.04607356  0.2828051 0.039073438 0.2133130
#>|           Masters   Preschool  Prof-school  Some-college
#>|  <=50K 0.03092658 0.002102199  0.006225744     0.2387209
#>|  >50K  0.12218404 0.000127275  0.053964618     0.1766578
#>| 
#>| Feature: V5 
#>| Type: Numeric
#>|             Mean       SD
#>|  <=50K  9.595065 2.436147
#>|  >50K  11.611657 2.385129
#>| 
#>| Feature: V6 
#>| Type: Categorical
#>|          Divorced  Married-AF-spouse  Married-civ-spouse  Married-spouse-absent
#>|  <=50K 0.16099810       0.0005661827           0.3350588            0.015570025
#>|  >50K  0.05912334       0.0014016310           0.8528287            0.004459735
#>|         Never-married   Separated    Widowed
#>|  <=50K     0.41222146 0.038823958 0.03676143
#>|  >50K      0.06269113 0.008537207 0.01095821
#>| 
#>| Feature: V7 
#>| Type: Categorical
#>|         Adm-clerical  Armed-Forces  Craft-repair  Exec-managerial
#>|  <=50K    0.13196410  0.0003638716     0.1282041       0.08486294
#>|  >50K     0.06467218  0.0002546149     0.1183959       0.25066836
#>|         Farming-fishing  Handlers-cleaners  Machine-op-inspct  Other-service
#>|  <=50K       0.03557856         0.05195278         0.07087410     0.12771893
#>|  >50K        0.01476766         0.01107575         0.03195417     0.01756843
#>|         Priv-house-serv  Prof-specialty  Protective-serv     Sales
#>|  <=50K     0.0060240964      0.09226166       0.01774885 0.1078677
#>|  >50K      0.0002546149      0.23679185       0.02698918 0.1252705
#>|         Tech-support  Transport-moving
#>|  <=50K    0.02611789        0.05166977
#>|  >50K     0.03615532        0.04086569
#>| 
#>| Feature: V8 
#>| Type: Categorical
#>|          Husband  Not-in-family  Other-relative   Own-child  Unmarried
#>|  <=50K 0.2942651      0.3013023     0.038218879 0.202297177 0.13059128
#>|  >50K  0.7543010      0.1092137     0.004842615 0.008665732 0.02790875
#>|              Wife
#>|  <=50K 0.03332524
#>|  >50K  0.09506818
#>| 
#>| Feature: V9 
#>| Type: Categorical
#>|         Amer-Indian-Eskimo  Asian-Pac-Islander      Black       Other     White
#>|  <=50K         0.011162791          0.03089990 0.11073812 0.009989889 0.8372093
#>|  >50K          0.004715779          0.03530461 0.04945195 0.003313790 0.9072139
#>| 
#>| Feature: V10 
#>| Type: Categorical
#>|           Female      Male
#>|  <=50K 0.3880349 0.6119651
#>|  >50K  0.1504526 0.8495474
#>| 
#>| Feature: V11 
#>| Type: Numeric
#>|             Mean         SD
#>|  <=50K  148.7525   963.1393
#>|  >50K  4006.1425 14570.3790
#>| 
#>| Feature: V12 
#>| Type: Numeric
#>|             Mean       SD
#>|  <=50K  53.14292 310.7558
#>|  >50K  195.00153 595.4876
#>| 
#>| Feature: V13 
#>| Type: Numeric
#>|            Mean       SD
#>|  <=50K 38.84021 12.31899
#>|  >50K  45.47303 11.01297
#>| 
#>| Feature: V14 
#>| Type: Categorical
#>|            Cambodia      Canada       China     Columbia        Cuba
#>|  <=50K 0.0005250192 0.003352046 0.002261621 0.0023423933 0.002867412
#>|  >50K  0.0010149708 0.005074854 0.002664298 0.0003806141 0.003298655
#>|         Dominican-Republic      Ecuador  El-Salvador     England       France
#>|  <=50K        0.0027866403 0.0010096523  0.003957837 0.002463552 0.0007269496
#>|  >50K         0.0003806141 0.0006343568  0.001268714 0.003933012 0.0016493276
#>|            Germany      Greece    Guatemala        Haiti  Holand-Netherlands
#>|  <=50K 0.003796293 0.000888494 0.0025039376 0.0016558297        8.077218e-05
#>|  >50K  0.005709211 0.001141842 0.0005074854 0.0006343568        1.268714e-04
#>|            Honduras         Hong      Hungary       India        Iran
#>|  <=50K 0.0005250192 0.0006057914 0.0004442470 0.002463552 0.001050038
#>|  >50K  0.0002537427 0.0008880995 0.0005074854 0.005201725 0.002410556
#>|             Ireland       Italy     Jamaica       Japan         Laos
#>|  <=50K 0.0008077218 0.001978918 0.002907799 0.001575058 0.0006865635
#>|  >50K  0.0007612281 0.003298655 0.001395585 0.003171784 0.0003806141
#>|             Mexico    Nicaragua  Outlying-US(Guam-USVI-etc)         Peru
#>|  <=50K 0.024675902 0.0013327410                0.0006057914 0.0012115827
#>|  >50K  0.004313626 0.0003806141                0.0001268714 0.0003806141
#>|         Philippines      Poland     Portugal  Puerto-Rico     Scotland
#>|  <=50K  0.005573281 0.001978918 0.0013731271  0.004159767 0.0004038609
#>|  >50K   0.007866024 0.001649328 0.0006343568  0.001649328 0.0005074854
#>|              South      Taiwan     Thailand  Trinadad&Tobago  United-States
#>|  <=50K 0.002625096 0.001292355 0.0006461775     0.0007269496      0.8884940
#>|  >50K  0.002156813 0.002664298 0.0005074854     0.0003806141      0.9099213
#>|             Vietnam   Yugoslavia
#>|  <=50K 0.0025443237 0.0004442470
#>|  >50K  0.0007612281 0.0008880995
#>| 
#>| =================

7 Supervisor-provided R code for Naive Bayes Classifier

The followings are the code provided from my supervisor, which suit for the adult data only.

Code

ptm <- proc.time()

#### read training and tesing
D = read.table("adult.data",sep=",",stringsAsFactors = F)
Dt = read.table("adult.test",sep=",",skip=1,stringsAsFactors = F)

colnames(D)= c("age","workclass","fnlwgt","education","education_num","matrital_status",
               "occupation","relationship","race","sex","capital_gain","capital_loss",
               "hours_per_week","native_country/region","salary")
colnames(Dt) = colnames(D)

#### extract discrete variables
dis.indice = c(2,4,6,7,8,9,10,14)
con.indice = c(1,3,5,11,12,13)
is.class.zero = (D[,15]==" <=50K")

Yt = rep(0,nrow(Dt))
Yt[Dt[,15]==" >50K."] = 1

#### training and testing
alpha = 1
n0 = sum(is.class.zero)
n1 = sum(!is.class.zero)

P0 = matrix(0,nrow=nrow(Dt),ncol=(ncol(Dt)-1))
P1 = P0

##### for discrete variables
for (i in dis.indice){
  x = c(D[is.class.zero,i],D[!is.class.zero,i],Dt[,i])
  x = as.factor(x)
  
  xnames = levels(x)
  loc = which(xnames==" ?")
  if (length(loc)!=0){
    xnames = xnames[-loc]
  }
  d = length(xnames)
  
  #### training and testing at the same time
  for (j in 1:d){
    P0[Dt[,i]==xnames[j],i] = log((sum(D[is.class.zero,i]==xnames[j])+alpha)/(n0+d*alpha))
    P1[Dt[,i]==xnames[j],i] = log((sum(D[!is.class.zero,i]==xnames[j])+alpha)/(n1+d*alpha))
  }
}


##### for continuous variables
for (i in con.indice){
  sigma0 = sd(D[is.class.zero,i]); sigma1 = sd(D[!is.class.zero,i])
  mu0 = mean(D[is.class.zero,i]); mu1 = mean(D[!is.class.zero,i])
  P0[,i] = -log(sigma0)-(Dt[,i]-mu0)^2/sigma0^2/2
  P1[,i] = -log(sigma1)-(Dt[,i]-mu1)^2/sigma1^2/2
}

#### NB result
P0 = cbind(P0,rep(log(n0/(n0+n1)),nrow(P0)))
P1 = cbind(P1,rep(log(n1/(n0+n1)),nrow(P1)))
diff.P = P0 - P1
colnames(diff.P) = c("age","workclass","fnlwgt","education","education_num","matrital_status",
  "occupation","relationship","race","sex","capital_gain","capital_loss",
  "hours_per_week","native_country/region","prior")

diff = rowSums(diff.P)          

#M0 = rowSums(P0)+log(n0/(n0+n1))
#M1 = rowSums(P1)+log(n1/(n0+n1))
#diff = M0-M1

###### plot ROC
Yt.sorted = Yt[sort(diff,index.return=T)[["ix"]]]
sen = rep(0,length(Yt.sorted)+1)
spe = rep(sum(Yt.sorted==0),length(Yt.sorted)+1)
for (j in 1:length(Yt.sorted)){
  if (Yt.sorted[j]==1){
    sen[j+1]=sen[j]+1;spe[j+1]=spe[j]
  }else{
    spe[j+1]=spe[j]-1;sen[j+1]=sen[j]
  }
}
sen = sen/sum(Yt.sorted==1)
spe = spe/sum(Yt.sorted==0)

auc = sum((spe[1:(length(spe)-1)]-spe[2:length(spe)])*sen[1:(length(sen)-1)])

### confusion matrix
Confmat = matrix(NA,nrow = 2,ncol = 2)
rownames(Confmat) = c("<=50K",">50K")
colnames(Confmat) = c("Predict <=50K","Predict >50K")
Confmat["<=50K","Predict <=50K"] = sum(diff>0 & Yt==0)
Confmat["<=50K","Predict >50K"] = sum(diff<0 & Yt==0)
Confmat[">50K","Predict <=50K"] = sum(diff>0 & Yt==1)
Confmat[">50K","Predict >50K"] = sum(diff<0 & Yt==1)
acc = sum(Confmat[c(1,4)])/sum(Confmat)

Result:

Confmat
#>|       Predict <=50K Predict >50K
#>| <=50K         11547          888
#>| >50K           1863         1983
cat("Total accuary =", acc)
#>| Total accuary = 0.83103

proc.time() - ptm
#>|    user  system elapsed 
#>|   0.216   0.008   0.224
plot(1-spe,sen,type="l",col="blue",main = paste0("ROC curve, AUC =",round(auc,3)),ylab = "sensitivity",xlab = "1-specificity")
abline(a=0,b=1)