Machine Learning – Random Forests

This page shows the progress of my research project for the Machine Learning for Data Analysis course by Wesleyan University on Coursera (Week 2 Assignment: Running a Random Forest).

Summary of the Random Forest Model

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable, Alcohol Use Disorder (AlcDisorder), with the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) 2001–2002 wave 1 data (n = 43,093).

The following explanatory variables were included as possible contributors to a random forest evaluating Alcohol Abuse/Dependence (i.e., Alcohol Use Disorder): age, gender, race/ethnicity, household income, marital status, education level, occupation and other job-related conditions, relationship problems, financial hardship, the region and type of housing, age at first drinking alcohol, drinking pattern and frequency, smoking, drug uses, gambling, mental health problems, family history of any addiction or mental health issues, etc. [*The complete list is shown below]

The random forest model was built based on the following settings: Variables to Try=8, Maximum Trees=100, Maximum Depth=20, Inbag Fraction=0.6, Split Criterion=Gini, and Prune Fraction=0 (No pruning). The accuracy of the model was 92.3% (or Misclassification Rate=0.077), which is pretty good for the purpose of the current work as a variable reduction technique.

The explanatory variables with the top 10 highest relative importance scores were as follows:

  1. S2AQ10 (HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS)
  2. S2AQ12B (HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS).
  3. S2AQ5B (HOW OFTEN DRANK BEER IN LAST 12 MONTHS)
  4. DGSTATUS (DRUG USE STATUS)
  5. S2AQ12A (HOW OFTEN DRANK BEFORE 3 PM IN LAST 12 MONTHS)
  6. S2AQ12D (HOW OFTEN DRANK IN PUBLIC PLACES IN LAST 12 MONTHS)
  7. S2AQ7B (HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS)
  8. S3BQ1A6 (EVER USED COCAINE OR CRACK)
  9. AgeAlcUse (AGE RANGE WHEN STARTED DRINKING)
  10. TAB12MDX (NICOTINE DEPENDENCE IN THE LAST 12 MONTHS)

It is interesting to see that the importance score of the age when started drinking (AgeAlcUse) is nearly as high as those for the past experience of drug uses (S3BQ1A6) or nicotine dependence (TAB12MDX). This finding supports the assumption that the age when started drinking is a potentially powerful predictor of progression to alcohol-related harm such as alcohol abuse and dependence. As many researchers have suggested, the study results on this topic can raise important policy implications for the development of alcohol-abuse prevention programs for youth (see the summary of the literature review for the references here).

 

* The full list of the 59 explanatory variables used in the model is as follows:

SEX=”GENDER”; AGE=”AGE”; AgeAlcUse=”AGE RANGE WHEN STARTED DRINKING”;
CENDIV=”CENSUS DIVISION”; BUILDTYP=”TYPE OF BUILDING FOR HOUSEHOLD”;
S1Q1D5=”WHITE CHECKED IN MULTIRACE CODE”; S1Q1C=”HISPANIC OR LATINO” ; S1Q1D1=”AMERICAN INDIAN OR ALASKA NATIVE”; S1Q1D2=”ASIAN”;  S1Q1D3=”BLACK OR AFRICAN AMERICAN”; S1Q1D4=”NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER”;
S1Q2A=”LIVED WITH AT LEAST 1 BIOLOGICAL PARENT BEFORE AGE 18″; S1Q2B=”BIOLOGICAL FATHER EVER LIVE IN HOUSEHOLD BEFORE RESPONDENT WAS 18″;
SPOUSE=”SPOUSE OF RESPONDENT IN HOUSEHOLD”;
S1Q4A=”AGE AT FIRST MARRIAGE”; S1Q4B=”HOW FIRST MARRIAGE ENDED”;
S1Q6A=”HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED”;
S1Q7A1=”WORKING FULL TIME (35+ HOURS A WEEK)”; S1Q7A2=”WORKING PART TIME (<35 HOURS A WEEK)”; S1Q7A3=”EMPLOYED BUT NOT AT WORK BECAUSE OF TEMPORARY ILLNESS OR INJURY”;
S1Q9A=”BUSINESS OR INDUSTRY”; S1Q9B=”OCCUPATION”; S1Q9C=”TYPE OF EMPLOYER”;
S1Q12B=”TOTAL HOUSEHOLD INCOME IN LAST 12 MONTHS”; S1Q14A=”PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS”;
S1Q16=”SELF-PERCEIVED CURRENT HEALTH”;
S1Q232=”ANY FAMILY MEMBERS OR CLOSE FRIENDS HAD SERIOUS ILLNESS OR INJURY IN LAST 12 MONTHS”; S1Q233=”MOVED/ANYONE NEW CAME TO LIVE WITH YOU IN LAST 12 MONTHS”;
S1Q234=”FIRED OR LAID OFF FROM JOB IN LAST 12 MONTHS”; S1Q236=”HAD TROUBLE WITH BOSS OR COWORKER IN LAST 12 MONTHS”; S1Q237=”CHANGED JOBS, JOB RESPONSIBILITIES OR WORK HOURS IN LAST 12 MONTHS”;
S1Q238=”GOT SEPARATED OR DIVORCED OR BROKE OFF STEADY RELATIONSHIP IN LAST 12 MONTHS”; S1Q239=”HAD PROBLEMS WITH NEIGHBOR, FRIEND OR RELATIVE IN LAST 12 MONTHS”;
S1Q2310=”EXPERIENCED MAJOR FINANCIAL CRISIS OR BANKRUPTCY IN LAST 12 MONTHS”;
S1Q2312=”YOU OR FAMILY MEMBER BEEN VICTIM OF CRIME IN LAST 12 MONTHS”;
S2DQ1=”BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER”; S2DQ2=”BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER”; S2DQ3C2=”ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS”; S2DQ4C2=”ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS”;
SMOKER=”TOBACCO USE STATUS”; S3AQ3C1=”USUAL QUANTITY WHEN SMOKED CIGARETTES”; TAB12MDX=”NICOTINE DEPENDENCE IN THE LAST 12 MONTHS”; TABP12MDX=”NICOTINE DEPENDENCE PRIOR TO THE LAST 12 MONTHS”;
S3BQ1A6=”EVER USED COCAINE OR CRACK”; DGSTATUS=”DRUG USE STATUS”;
DGENAXDXSNI12=”GENERALIZED ANXIETY IN LAST 12 MONTHS”; DGENAXDXSNIP12=”GENERALIZED ANXIETY PRIOR TO THE LAST 12 MONTHS”;
GAMB12DX=”PATHOLOGICAL GAMBLING IN LAST 12 MONTHS”; GAMBP12DX=”PATHOLOGICAL GAMBLING PRIOR TO THE LAST 12 MONTHS”;
ANTISOCDX2=”ANTISOCIAL PERSONALITY DISORDER (WITH CONDUCT DISORDER)”; AVOIDPDX2=”AVOIDANT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)”; DEPPDDX2=”DEPENDENT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)”;
S2AQ4B=”HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS”; S2AQ5B=”HOW OFTEN DRANK BEER IN LAST 12 MONTHS”; S2AQ6B=”HOW OFTEN DRANK WINE IN LAST 12 MONTHS”; S2AQ7B=”HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS”; S2AQ10=”HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS”; S2AQ12A=”HOW OFTEN DRANK BEFORE 3 PM IN LAST 12 MONTHS”; S2AQ12B=”HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS”; S2AQ12C=”HOW OFTEN DRANK AT HOME ALONE IN LAST 12 MONTHS”; S2AQ12D=”HOW OFTEN DRANK IN PUBLIC PLACES IN LAST 12 MONTHS”

SAS Program Code

DecisionTree_SAScode_1DecisionTree_SAScode_2

SAS Program Output

RandomForest_Out_mk1RandomForest_Out_mk2

· · ·

RandomForest_Out_mk3

RandomForest_Out_mk4RandomForest_Out_mk5