Machine Learning – Lasso Regression

This page shows the progress of my research project for the Machine Learning for Data Analysis course by Wesleyan University on Coursera (Week 3 Assignment: Running a Lasso Regression Analysis).

Summary of the Lasso Regression Analysis Model

A LASSO (least absolute shrinkage and selection operator) regression analysis using k-fold cross validation was performed to identify a subset of predictors from larger pool of predictor variables that best predicts a quantitative response variable, S2BQ3A (AGE AT ONSET OF ALCOHOL ABUSE), with the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) 2001–2002 wave 1 data (n = 43,093).

The following explanatory variables were included as possible contributors to a lasso regression model evaluating AGE AT ONSET OF ALCOHOL ABUSE: age, gender, race/ethnicity, household income, marital status, education level, occupation and other job-related conditions, relationship problems, financial hardship, the region and type of housing, age at first drinking alcohol, drinking pattern and frequency, smoking, drug uses, gambling, mental health problems, family history of any addiction or mental health issues, etc. [*The complete list is shown below]

First, the data was split randomly into training and test data (70% and 30%, respectively) by the simple random sampling method [using the SAS code ‘proc surveyselect’ with the samprate=0.7 and method=srs options]. Then, a Lasso regression analysis was performed based on the least angle regression (lars) model selection algorithm and k=10-fold cross validation. The result suggested the best multivariate regression model for the target variable, AGE AT ONSET OF ALCOHOL ABUSE, would include only 22 variables among the 58 explanatory variables provided for the model. Please see the outputs from the model for the details in the SAS Program Output section below. [Note: The best model selected by the lars algorithm here is not necessarily a good model to predict the AGE AT ONSET OF ALCOHOL ABUSE in that the model can explain only about 23% of the variance (Adj R-Sq = 0.226).]

 

* The full list of the 58 explanatory variables used in the model is as follows:

SEX=”GENDER”; AGE=”AGE”; S2AQ16A =”AGE WHEN STARTED DRINKING”; S2AQ19=”AGE AT START OF PERIOD OF HEAVIEST DRINKING”;
CENDIV=”CENSUS DIVISION”; BUILDTYP=”TYPE OF BUILDING FOR HOUSEHOLD”;
S1Q1D5=”WHITE CHECKED IN MULTIRACE CODE”; S1Q1C=”HISPANIC OR LATINO” ; S1Q1D1=”AMERICAN INDIAN OR ALASKA NATIVE”; S1Q1D2=”ASIAN”; S1Q1D3=”BLACK OR AFRICAN AMERICAN”; S1Q1D4=”NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER”;
S1Q2A=”LIVED WITH AT LEAST 1 BIOLOGICAL PARENT BEFORE AGE 18″; S1Q2B=”BIOLOGICAL FATHER EVER LIVE IN HOUSEHOLD BEFORE RESPONDENT WAS 18″;
SPOUSE=”SPOUSE OF RESPONDENT IN HOUSEHOLD”;
S1Q4A=”AGE AT FIRST MARRIAGE”; S1Q4B=”HOW FIRST MARRIAGE ENDED”;
S1Q6A=”HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED”;
S1Q7A1=”WORKING FULL TIME (35+ HOURS A WEEK)”; S1Q7A2=”WORKING PART TIME (<35 HOURS A WEEK)”; S1Q7A3=”EMPLOYED BUT NOT AT WORK BECAUSE OF TEMPORARY ILLNESS OR INJURY”;
S1Q9A=”BUSINESS OR INDUSTRY”; S1Q9B=”OCCUPATION”; S1Q9C=”TYPE OF EMPLOYER”;
S1Q12B=”TOTAL HOUSEHOLD INCOME IN LAST 12 MONTHS”; S1Q14A=”PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS”;
S1Q16=”SELF-PERCEIVED CURRENT HEALTH”;
S1Q232=”ANY FAMILY MEMBERS OR CLOSE FRIENDS HAD SERIOUS ILLNESS OR INJURY IN LAST 12 MONTHS”; S1Q233=”MOVED/ANYONE NEW CAME TO LIVE WITH YOU IN LAST 12 MONTHS”;
S1Q234=”FIRED OR LAID OFF FROM JOB IN LAST 12 MONTHS”; S1Q236=”HAD TROUBLE WITH BOSS OR COWORKER IN LAST 12 MONTHS”; S1Q237=”CHANGED JOBS, JOB RESPONSIBILITIES OR WORK HOURS IN LAST 12 MONTHS”;
S1Q238=”GOT SEPARATED OR DIVORCED OR BROKE OFF STEADY RELATIONSHIP IN LAST 12 MONTHS”; S1Q239=”HAD PROBLEMS WITH NEIGHBOR, FRIEND OR RELATIVE IN LAST 12 MONTHS”;
S1Q2310=”EXPERIENCED MAJOR FINANCIAL CRISIS OR BANKRUPTCY IN LAST 12 MONTHS”;
S1Q2312=”YOU OR FAMILY MEMBER BEEN VICTIM OF CRIME IN LAST 12 MONTHS”;
S2DQ1=”BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER”; S2DQ2=”BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER”; S2DQ3C2=”ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS”; S2DQ4C2=”ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS”;
SMOKER=”TOBACCO USE STATUS”; S3AQ3C1=”USUAL QUANTITY WHEN SMOKED CIGARETTES”; TAB12MDX=”NICOTINE DEPENDENCE IN THE LAST 12 MONTHS”; TABP12MDX=”NICOTINE DEPENDENCE PRIOR TO THE LAST 12 MONTHS”;
S3BQ1A6=”EVER USED COCAINE OR CRACK”; DGSTATUS=”DRUG USE STATUS”;
DGENAXDXSNI12=”GENERALIZED ANXIETY IN LAST 12 MONTHS”; DGENAXDXSNIP12=”GENERALIZED ANXIETY PRIOR TO THE LAST 12 MONTHS”;
GAMB12DX=”PATHOLOGICAL GAMBLING IN LAST 12 MONTHS”; GAMBP12DX=”PATHOLOGICAL GAMBLING PRIOR TO THE LAST 12 MONTHS”;
ANTISOCDX2=”ANTISOCIAL PERSONALITY DISORDER (WITH CONDUCT DISORDER)”; AVOIDPDX2=”AVOIDANT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)”; DEPPDDX2=”DEPENDENT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)”;
S2AQ5B=”HOW OFTEN DRANK BEER IN LAST 12 MONTHS”; S2AQ6B=”HOW OFTEN DRANK WINE IN LAST 12 MONTHS”; S2AQ7B=”HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS”; S2AQ10=”HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS”; S2AQ12A=”HOW OFTEN DRANK BEFORE 3 PM IN LAST 12 MONTHS”; S2AQ12B=”HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS”; S2AQ12C=”HOW OFTEN DRANK AT HOME ALONE IN LAST 12 MONTHS”; S2AQ12D=”HOW OFTEN DRANK IN PUBLIC PLACES IN LAST 12 MONTHS”

SAS Program Code

LASSO_SAScode_mk1LASSO_SAScode_mk2LASSO_SAScode_mk3

SAS Program Output

LASSO_SAS_out_mk1LASSO_SAS_out_mk2LASSO_SAS_out_mk3LASSO_SAS_out_mk4Lasso_out_pic1Lasso_out_pic2Lasso_out_pic3LASSO_SAS_out_mk5LASSO_SAS_out_mk6