(View the complete code for this example.)
The Sashelp.JunkMail data set comes from a study that classifies whether an email is junk email (coded as 1) or not (coded as 0). The data were collected by Hewlett-Packard Labs and donated by George Forman. The data set, which is specified in the following program, contains 4,601 observations, with 2 binary variables and 57 continuous explanatory variables. The response variable, Class, is a binary indicator of whether an email is considered spam or not. The partitioning variable, Test, is a binary indicator that is used to divide the data into training and testing sets. The 57 explanatory variables are continuous variables that represent frequencies of some common words and characters and lengths of uninterrupted sequences of capital letters in emails.
In the following program, the PARTITION statement divides the data into two parts. The training data have a Test value of 0 and contain about two-thirds of the data; the rest of the data are used to evaluate the fit. A forward selection method selects the best model based on the training observations.
proc hplogistic data=Sashelp.JunkMail;
model Class(event='1')=Make Address All _3d Our Over Remove Internet Order
Mail Receive Will People Report Addresses Free Business Email You
Credit Your Font _000 Money HP HPL George _650 Lab Labs Telnet _857
Data _415 _85 Technology _1999 Parts PM Direct CS Meeting Original
Project RE Edu Table Conference Semicolon Paren Bracket Exclamation
Dollar Pound CapAvg CapLong CapTotal;
partition rolevar=Test(train='0' test='1');
selection method=forward;
run;
Selected results from the analysis are shown in Output 61.4.1 through Output 61.4.3.
The "Number of Observations" and "Response Profile" tables in Output 61.4.1 are divided into training and testing columns.
Output 61.4.1: Partitioned Counts
| Number of Observations | |||
|---|---|---|---|
| Description | Total | Training | Testing |
| Number of Observations Read | 4601 | 3065 | 1536 |
| Number of Observations Used | 4601 | 3065 | 1536 |
| Response Profile | ||||
|---|---|---|---|---|
| Ordered Value |
0 - Not Junk, 1 - Junk |
Total Frequency |
Training | Testing |
| 1 | 0 | 2788 | 1847 | 941 |
| 2 | 1 | 1813 | 1218 | 595 |
| You are modeling the probability that Class='1'. |
The standard likelihood-based fit statistics for the selected model are displayed in the "Fit Statistics" table, with a column for each of the training and testing subsets.
Output 61.4.2: Partitioned Fit Statistics
| Fit Statistics | ||
|---|---|---|
| Description | Training | Testing |
| -2 Log Likelihood | 1202.18 | 813.03 |
| AIC (smaller is better) | 1262.18 | 873.03 |
| AICC (smaller is better) | 1262.80 | 874.27 |
| BIC (smaller is better) | 1443.02 | 1033.14 |
More fit statistics are displayed in the "Partition Fit Statistics" table shown in Output 61.4.3. These statistics are computed for both the training and testing data and should be very similar between the two groups when the training data are representative of the testing data. The statistics include the likelihood-based R-square statistics, as well as several prediction-based statistics that are described in the sections Model Fit and Assessment Statistics and The Hosmer-Lemeshow Goodness-of-Fit Test. For this model, the values of the statistics seem similar between the two disjoint subsets.
Output 61.4.3: More Partitioned Fit Statistics
| Partition Fit Statistics | ||
|---|---|---|
| Statistic | Training | Testing |
| Area under the ROCC | 0.9769 | 0.9653 |
| Average Square Error | 0.05467 | 0.06351 |
| Hosmer-Lemeshow Test | 3.74E-49 | 0 |
| Misclassification Error | 0.07145 | 0.07878 |
| R-Square | 0.6139 | 0.5533 |
| Max-rescaled R-Square | 0.8305 | 0.7508 |
| McFadden's R-Square | 0.7081 | 0.6035 |
| Mean Difference | 0.7596 | 0.7393 |
| Somers' D | 0.9538 | 0.9307 |
| True Negative Fraction | 0.9556 | 0.9416 |
| True Positive Fraction | 0.8875 | 0.8891 |
If you want to display the "Partition Fit Statistics" table without partitioning your data set, you must identify all your data as training data. One way to do this is to define the fractions for the other roles to be zero:
proc hplogistic data=Sashelp.JunkMail;
model Class(event='1')= Our Over Remove Internet Order Will
Free Business You Your Font _000 Money HP George Parts
Meeting RE Edu Semicolon Exclamation Dollar CapAvg
CapLong;
partition fraction(test=0 validation=0);
run;
Another way is to specify a constant variable as the training role:
data JunkMail;
set Sashelp.JunkMail;
Role=0;
run;
proc hplogistic data=JunkMail;
model Class(event='1')= Our Over Remove Internet Order Will
Free Business You Your Font _000 Money HP George Parts
Meeting RE Edu Semicolon Exclamation Dollar CapAvg
CapLong;
partition role=Role(train='0');
run;
The resulting "Partition Fit Statistics" table is shown in Output 61.4.4.
Output 61.4.4: All Data Are Training Data
| Partition Fit Statistics | |
|---|---|
| Statistic | Training |
| Area under the ROCC | 0.9724 |
| Average Square Error | 0.05910 |
| Hosmer-Lemeshow Test | 854E-220 |
| Misclassification Error | 0.07324 |
| R-Square | 0.5932 |
| Max-rescaled R-Square | 0.8034 |
| McFadden's R-Square | 0.6708 |
| Mean Difference | 0.7325 |
| Somers' D | 0.9448 |
| True Negative Fraction | 0.9570 |
| True Positive Fraction | 0.8803 |