The HPLOGISTIC Procedure

Example 61.4 Partitioning Data

(View the complete code for this example.)

The Sashelp.JunkMail data set comes from a study that classifies whether an email is junk email (coded as 1) or not (coded as 0). The data were collected by Hewlett-Packard Labs and donated by George Forman. The data set, which is specified in the following program, contains 4,601 observations, with 2 binary variables and 57 continuous explanatory variables. The response variable, Class, is a binary indicator of whether an email is considered spam or not. The partitioning variable, Test, is a binary indicator that is used to divide the data into training and testing sets. The 57 explanatory variables are continuous variables that represent frequencies of some common words and characters and lengths of uninterrupted sequences of capital letters in emails.

In the following program, the PARTITION statement divides the data into two parts. The training data have a Test value of 0 and contain about two-thirds of the data; the rest of the data are used to evaluate the fit. A forward selection method selects the best model based on the training observations.

proc hplogistic data=Sashelp.JunkMail;
   model Class(event='1')=Make Address All _3d Our Over Remove Internet Order
         Mail Receive Will People Report Addresses Free Business Email You
         Credit Your Font _000 Money HP HPL George _650 Lab Labs Telnet _857
         Data _415 _85 Technology _1999 Parts PM Direct CS Meeting Original
         Project RE Edu Table Conference Semicolon Paren Bracket Exclamation
         Dollar Pound CapAvg CapLong CapTotal;
   partition rolevar=Test(train='0' test='1');
   selection method=forward;
run;

Selected results from the analysis are shown in Output 61.4.1 through Output 61.4.3.

The "Number of Observations" and "Response Profile" tables in Output 61.4.1 are divided into training and testing columns.

Output 61.4.1: Partitioned Counts

The HPLOGISTIC Procedure

Number of Observations
Description Total Training Testing
Number of Observations Read 4601 3065 1536
Number of Observations Used 4601 3065 1536

Response Profile
Ordered
Value
0 - Not
Junk,
1 - Junk
Total
Frequency
Training Testing
1 0 2788 1847 941
2 1 1813 1218 595

You are modeling the probability that Class='1'.



The standard likelihood-based fit statistics for the selected model are displayed in the "Fit Statistics" table, with a column for each of the training and testing subsets.

Output 61.4.2: Partitioned Fit Statistics

Fit Statistics
Description Training Testing
-2 Log Likelihood 1202.18 813.03
AIC (smaller is better) 1262.18 873.03
AICC (smaller is better) 1262.80 874.27
BIC (smaller is better) 1443.02 1033.14


More fit statistics are displayed in the "Partition Fit Statistics" table shown in Output 61.4.3. These statistics are computed for both the training and testing data and should be very similar between the two groups when the training data are representative of the testing data. The statistics include the likelihood-based R-square statistics, as well as several prediction-based statistics that are described in the sections Model Fit and Assessment Statistics and The Hosmer-Lemeshow Goodness-of-Fit Test. For this model, the values of the statistics seem similar between the two disjoint subsets.

Output 61.4.3: More Partitioned Fit Statistics

Partition Fit Statistics
Statistic Training Testing
Area under the ROCC 0.9769 0.9653
Average Square Error 0.05467 0.06351
Hosmer-Lemeshow Test 3.74E-49 0
Misclassification Error 0.07145 0.07878
R-Square 0.6139 0.5533
Max-rescaled R-Square 0.8305 0.7508
McFadden's R-Square 0.7081 0.6035
Mean Difference 0.7596 0.7393
Somers' D 0.9538 0.9307
True Negative Fraction 0.9556 0.9416
True Positive Fraction 0.8875 0.8891


If you want to display the "Partition Fit Statistics" table without partitioning your data set, you must identify all your data as training data. One way to do this is to define the fractions for the other roles to be zero:

proc hplogistic data=Sashelp.JunkMail;
   model Class(event='1')= Our Over Remove Internet Order Will
         Free Business You Your Font _000 Money HP George Parts
         Meeting RE Edu Semicolon Exclamation Dollar CapAvg
         CapLong;
   partition fraction(test=0 validation=0);
run;

Another way is to specify a constant variable as the training role:

data JunkMail;
   set Sashelp.JunkMail;
   Role=0;
run;
proc hplogistic data=JunkMail;
   model Class(event='1')= Our Over Remove Internet Order Will
         Free Business You Your Font _000 Money HP George Parts
         Meeting RE Edu Semicolon Exclamation Dollar CapAvg
         CapLong;
   partition role=Role(train='0');
run;

The resulting "Partition Fit Statistics" table is shown in Output 61.4.4.

Output 61.4.4: All Data Are Training Data

The HPLOGISTIC Procedure

Partition Fit Statistics
Statistic Training
Area under the ROCC 0.9724
Average Square Error 0.05910
Hosmer-Lemeshow Test 854E-220
Misclassification Error 0.07324
R-Square 0.5932
Max-rescaled R-Square 0.8034
McFadden's R-Square 0.6708
Mean Difference 0.7325
Somers' D 0.9448
True Negative Fraction 0.9570
True Positive Fraction 0.8803


Last updated: December 09, 2022