Abstract

The aim of this project is to discover and identify the relationships between alcohol consumption and other physical and social variables of the individuals. For achieving this we explored the dataset of the National Health survey of United States.

The main challenge we had to face was to select from over 2000 variables available in the dataset only those that were useful for our research. Apart from that, we had to transform the encountered variables and subsistute missing values from the response variable.

We found out that there is a weak but significant correlation between the family poverty index and alcohol consumption. Additionally, alcohol consumption depends on the gender and marital status. Males tend to consume more alcohol than females and those who never married tend to drink more than those separated. On top of that, we found other relations to alcohol consumption with drug use and the education of the individual.

Introduction

Is alcohol consumption related to factors such as mental health, marital status, income, gender and age?

We did not have a very specific question in mind as we were more curious to analyze the factors related to alcohol consumption in general and find significant or unexpected insights. Some of our initial questions lead us to dead-ends after analyzing the data more in depth, this is why we decided to focus on a broader research, centered around alcohol consumption and with a focus on mental health and personal situations.

One of the initial purposes of developing this project was to find out how mental health relates to alcohol consumption and this has been quite challenging. Establishing this correlation could potentially raise awareness on how alcohol consumption might affect the mental health or that certain untreated mental issues could lead to higher amount consumptions. We were also interested in knowing how alcohol consumption relates to income. For instance, if people of lower income or higher income tend to consume more alcohol. This could potentially raise awareness of vicious cycles in our society. This study could also help verify or deny common beliefs or expectations. For instance, do males consume more alcohol than females? Do married people consume less alcohol than the single and separated/widowed ones? All these could show results that would futher lead researches to continue with more in depth analysis such as why certain relations exist.

Exploratory Data Analysis

Data Source: National health and nutrition examination survey

This data was obtained by the United States center of health statistics. It is an annual survey done for the civilian resident population of the United States. This is the public dataset corresponding to the period 2013-2014, and it was saved in the following folder: /data/national-health-survey/

The data collection of this survey was a complex task that involved several parts, starting with body measurements to questionnaires to evaluate the mental health. A further description of the methodology can be found here.

Data description

The dataset is divided in several sections: demographic, examinations, dietary data, laboratory, medications and questionnaires. For our study, we are going to focus on the questionnaire section, taking some columns from demographic (such as Gender, Age or Marital status) and from Laboratory (HasHIV). The variables in the questionary dataset refer to a response for a certain question, such as: In the past 12 months, how often did you drink any type of alcoholic beverage? The selected datasets are composed of 10175 rows and 1422 columns. For instance, 15 of those columns refer to answers for questions related with alcohol consumption and 22 are related with mental health.

In order to select the interesting variables for our research question, we reserach over the 1422 variables in the 3 selected datasets for questions that could be related with alcohol consumption. After doing so, we had to rename the variable codes to a readable description of the column, in order to be able to work in a more usable way with them. Not only we renamed the variables but the possible values that those variables could have. For instance, the possible values for indicating the marital status of an individual were codes from 1 to 6, where 1 meant Married, 2 was Widowed and so on. A huge advantage in the process is the great documentation of the dataset, given a variable code, it was well explained in the National Center for Health Statistics website.

We considered 44 variables from the total 1422, which are described in a more detailed manner at the data folder. Some of the features we considered during our analysis were the following:

As we can see from the presented features, there are four variables related with the demographics of the individual (HighestEducationLevel, MaritalStatus, Gender, Age, MonthlyFamilyIncome).

Also, there are some of the variables we chose which represent the mental health of the individual: DifficultyConcentrating, ThoughtSuicideLast2w,

And finally, we have three variables selected for alcohol consumption. One is AlcoholDrink5Last30d, which we could not use as there were missing many of the values, and AlcoholAmountAvg and AlcoholAmountUnit. In order to have the AlcoholAmountAvg (average of number of alcoholic drinks) in the same unit, we had to normalize the variable and transform the values to store the AlcoholAmountAvg per month. In order to do so, if the unit was week, we multiplied the value by 4, otherwise, if it was Year, it would be divided by 12. After that, we are going to select that new variable AlcoholAmountAvgPerMonth as our response variable.

Data Preprocess

Data Transformation

All the data transformation and preprocessing can be found in the preprocessing file.

First of all we joined the selected datasets by the SEQN number, which corresponds to the individual identification number. After that, we renamed the variables codes into a more readable name, as we explained in the previous section.

In the dataset, in some variables the repetition of 7’s (7, 77, 777, 7777) or 9’s (9, 99, 999, 9999) meant that the respondant refused to answer, so we had to replace them by NA, as they are not interesting in our research. Besides, we replaced the possible values of the selected categorical variables to readable factors.

New Data Variables

After doing an analysis of some of the selected variables, we added new columns to analyze if they could be useful. The added variables are:

  • AlcoholAmountAvgPerMonth: As we explained in the previous section, we merged AlcoholAmountAvg and AlcoholAmountUnit variables in a smart way to come with a new variable that would represent the AlcoholAmountAvg of the individuals per month.
  • AlcoholAmountAvgPerMonthCubeRoot: After analyzing the skewness of our response variable AlcoholAmountAvgPerMonth, we found that it had a high right skewness of approxmately 2.0. We decided to transform the variable using cuberoot transformation, which resulted in a much better distribution of the data, with a skewness of 0.38, almost being a normal distribution. However, we have to take into account that the possible results of any model using this variable would be the cube root of the real one.
# Skewness for AlcoholAmountAvgPerMonth
skewness(transformed_dataset$AlcoholAmountAvgPerMonth)
## [1] 2.056278
# Skewness for cuberoot of AlcoholAmountAvgPerMonth
skewness(kader:::cuberoot(transformed_dataset$AlcoholAmountAvgPerMonth))
## [1] 0.3879408
ggplot(dataset, aes(AlcoholAmountAvgPerMonth)) +
  geom_histogram(fill = "brown", binwidth = 1) +
  ylab("Number of People") +
  xlab("Mothly Alcohol Consumption")

ggplot(dataset,aes(kader:::cuberoot(dataset$AlcoholAmountAvgPerMonth))) +
  geom_histogram(fill = "brown", binwidth = 0.1) +
  ylab("Number of People") +
  xlab("Mothly Alcohol Consumption")

  • HighestEducationLevelDisc: We decided to try discretizing the HighestEducationLevel variable. We splitted the 6 levels into “Basic”, “Intermediate” and “Advanced”.
  • HadSTI: We grouped all the variables that refer to STI’s and we added a new column that indicates if the individual has had an STI.
  • DrugUseLast30dSum: New variable that represents the sum of the number of days the individual has consumed any of the variables related to drugs such as MethanfitamineLast30d.
  • DrugCigsUseLast30dSum: Same as the previous variable but considering cigs as a drug.
  • DrugUseLast30d: Discretized variable of DrugUseLast30dSum, which represents if the individual used or did not use drugs in the last month, not giving importance to the exact amount.
  • DrugCigsUseLast30d: Same as the previous discretized variable but with DrugCigsUseLast30dSum.

Relationships among ariables

For exploring the relations between variables, before performing the statisticaly analysis, we displayed some plots.

First of all, we studied the relation between the average monthly alcohol consumption and the martial status of the respondants. We can see that the individuals that are married drink less than the separated or widowed ones. Despite a large group of the respondants are married, the other groups have a considerable amount of individuals, so we can say it is representative.

## Add new variable with a ponderated mean per Marital Status category
dataset_with_alcohol_mean_marital_status <-
  transformed_dataset %>% filter(MaritalStatus != "NA") %>% group_by(MaritalStatus) %>% summarise(
    AlcoholAmountAvgPerMonthMean = mean(AlcoholAmountAvgPerMonth),
    NumberOfPeopleInMaritalStatus = length(AlcoholAmountAvgPerMonth),
    AlcoholAmountAvgPerMonthPonderatedMean = mean(AlcoholAmountAvgPerMonth) / length(AlcoholAmountAvgPerMonth)
  )

## Show alcohol consumption mean per marital status group
dataset_with_alcohol_mean_marital_status %>%
  ggplot(aes(x = MaritalStatus, y = AlcoholAmountAvgPerMonthMean)) +
  geom_bar(aes(fill = MaritalStatus), stat = "identity") +
  ggtitle("Average Monthly consumption and Marital Status") +
  fancy_plot_no_legend

If we analyze the individuals that answer they have spend time in a bar in the last 7 days with the marital status, we can see that the people that have never been married go more often to a bar than the widowed people, for instance.

dataset_without_na_marital_status <-
  dataset %>% filter(MaritalStatus != "NA")

## See Marital Status and SpendTimeBar7d Per Marital Status (Stacked Bar) ALl 100%
dataset_without_na_marital_status %>%
  ggplot(aes(x = factor(1), fill = SpendTimeBar7d)) +
  geom_bar(stat = "count", position = "fill") +
  ggtitle("Spend Time In Bar in the last 7 days?") +
  facet_wrap(~MaritalStatus, switch = "both") +
  theme_minimal() +
  scale_fill_brewer(palette = "Reds") +
  fancy_plot

According to the FeelDownDepressedLast2W variable, we found an interesting fact, apaprently the individuals that have more often felt depressed in the last 2 weeks, are the ones with lower alcohol consumption per month. We represented the y axis with the cube root of the AlcoholConsumptionAvgPerMonth as the results are more normalized. To know the real alcohol consumption we will have to power the values from the y axis to 3.

dataset %>%
  filter(FeelDownDepressedLast2W != "NA") %>%
  ggplot(aes(x = FeelDownDepressedLast2W, y = AlcoholAmountAvgPerMonthCubeRoot, fill = FeelDownDepressedLast2W)) +
  geom_boxplot() +
  ggtitle("Alcohol Consumption per month cube Root - Feel Down or Depressed") +
  theme_minimal() +
  fancy_plot_no_legend

According to the family poverty index, we found a week correlation between it and our response variable, so we decided to study the relation between poverty index and time in bar. We found out that the individuals with lower poverty index have spend fewer time in a bar in the last 7 days, a logic result. As the family poverty increases, the number of respondants that have spend more time in a bar increase.

dataset %>%
  ggplot(aes(x = FamilyPovertyIndex, fill = SpendTimeBar7d)) +
  geom_density(alpha = 0.5) +
  ggtitle("Poverty Index - Time in Bar") +
  theme_minimal() +
  fancy_plot

Finally, we found an interesting relation between gender and alcohol consumption. As we can see in the next graph, males appararently consume more alcoholic drinks per month.

dataset %>%
  group_by(Gender) %>%
  summarize(AlcoholAmountAvgPerMonthMean = mean(AlcoholAmountAvgPerMonth)) %>%
  ggplot(aes(x = Gender, y = AlcoholAmountAvgPerMonthMean, fill = Gender)) +
  geom_bar(stat="identity") +
  ggtitle("Monthly Alcohol Consumption per Gender") +
  theme_minimal() +
  fancy_plot_no_legend

Methods

  library(caret)
## Loading required package: lattice
  # We can deal with the missing values of FamilyPovertyIndex by doing an average
  dataset_without_na <- dataset
  mean_fpi <- mean(dataset_without_na$FamilyPovertyIndex, na.rm = T)
  dataset_without_na$FamilyPovertyIndex[is.na(dataset_without_na$FamilyPovertyIndex)] <- mean_fpi

  # For the highest education level we would do the mean and subsitute for the corresponding factor
  mean_hel <- mean(dataset_without_na$HighestEducationLevel, na.rm = T) # It is 3.52, rounded to 4, which is Advanced
  dataset_without_na$HighestEducationLevelDisc[is.na(dataset_without_na$HighestEducationLevelDisc)] <- "Advanced"
  dataset_without_na$HighestEducationLevel[is.na(dataset_without_na$HighestEducationLevel)] <- mean_hel

  dataset_without_na <-
    DataCombine::DropNA(dataset_without_na, Var = "CountryBorn")
## 4 rows dropped from the data frame because of missing values.
  training_labels <- dataset_without_na

  # Create a set of training indices
  trainIndex <- createDataPartition(
    training_labels$AlcoholAmountAvgPerMonthCubeRoot, # Sample proportionally based on the outcome variable
    p = .8,           # Percentage to be used for training
    list = FALSE,     # Return the indices as a vector (not a list)
    times = 1         # Only create one set of indices
  )

  # Subset your data into training and testing set
  training_set <- training_labels[ trainIndex, ] # Use indices to get training data
  test_set <- training_labels[ -trainIndex, ]    # Remove train indices to get test data

  lm <- lm(AlcoholAmountAvgPerMonthCubeRoot ~ Gender + Age + CountryBorn + FamilyPovertyIndex + HighestEducationLevel + DrugCigsUseLast30dSum + SpendTimeBar7d, data = training_set)
  lm_prediction_train <- predict(lm, training_set)
  lm_prediction_test <- predict(lm, test_set)

  lm_r2_train <- R2(lm_prediction_train, training_set$AlcoholAmountAvgPerMonthCubeRoot)
  lm_r2_test <- R2(lm_prediction_test, test_set$AlcoholAmountAvgPerMonthCubeRoot)

  # R2 for LM in train
  print(lm_r2_train)
## [1] 0.1343156
  # R2 for LM in test
  print(lm_r2_test)
## [1] 0.1201753
  summary(lm)
## 
## Call:
## lm(formula = AlcoholAmountAvgPerMonthCubeRoot ~ Gender + Age + 
##     CountryBorn + FamilyPovertyIndex + HighestEducationLevel + 
##     DrugCigsUseLast30dSum + SpendTimeBar7d, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0495 -0.6338 -0.1090  0.5017  2.4228 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.2817110  0.0749347  17.104  < 2e-16 ***
## GenderFemale          -0.2183439  0.0274714  -7.948 2.52e-15 ***
## Age                   -0.0026938  0.0007919  -3.402 0.000677 ***
## CountryBornNoUS       -0.0743835  0.0330765  -2.249 0.024584 *  
## FamilyPovertyIndex     0.0854851  0.0101925   8.387  < 2e-16 ***
## HighestEducationLevel  0.0933859  0.0134814   6.927 5.08e-12 ***
## DrugCigsUseLast30dSum  0.0056345  0.0010101   5.578 2.61e-08 ***
## SpendTimeBar7dNo      -0.5171974  0.0419637 -12.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8162 on 3570 degrees of freedom
## Multiple R-squared:  0.1343, Adjusted R-squared:  0.1326 
## F-statistic: 79.13 on 7 and 3570 DF,  p-value: < 2.2e-16

As we can see in the summary of the model training, all the variables are significant in the model, ussing a significance level of 95%. This happens because all the p-values are less than 0.05. Also, we can see that the linear model is significant because we have a p-value of 2.2e-16, using a F-statistic.

But, we have a problem with the r-squared, which have a value of 0.15 approximately. In our field, having this value could be considered enough, because we have more than 1244 columns in our original dataset. However, we will try to improve it using a different model.

We used the generalized linear model in order to improve the results of the basic linear model.

  glm <- glm(AlcoholAmountAvgPerMonthCubeRoot ~ Gender + Age + CountryBorn + FamilyPovertyIndex + HighestEducationLevel + DrugCigsUseLast30dSum + SpendTimeBar7d, data = training_set)
  glm_prediction_train <- predict(glm, training_set)
  glm_prediction_test <- predict(glm, test_set)

  glm_r2_train <- R2(glm_prediction_train, training_set$AlcoholAmountAvgPerMonthCubeRoot)
  glm_r2_test <- R2(glm_prediction_test, test_set$AlcoholAmountAvgPerMonthCubeRoot)

  # R2 for GLM in train
  print(glm_r2_train)
## [1] 0.1343156
  # R2 for GLM in test
  print(glm_r2_test)
## [1] 0.1201753
  summary(glm)
## 
## Call:
## glm(formula = AlcoholAmountAvgPerMonthCubeRoot ~ Gender + Age + 
##     CountryBorn + FamilyPovertyIndex + HighestEducationLevel + 
##     DrugCigsUseLast30dSum + SpendTimeBar7d, data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0495  -0.6338  -0.1090   0.5017   2.4228  
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.2817110  0.0749347  17.104  < 2e-16 ***
## GenderFemale          -0.2183439  0.0274714  -7.948 2.52e-15 ***
## Age                   -0.0026938  0.0007919  -3.402 0.000677 ***
## CountryBornNoUS       -0.0743835  0.0330765  -2.249 0.024584 *  
## FamilyPovertyIndex     0.0854851  0.0101925   8.387  < 2e-16 ***
## HighestEducationLevel  0.0933859  0.0134814   6.927 5.08e-12 ***
## DrugCigsUseLast30dSum  0.0056345  0.0010101   5.578 2.61e-08 ***
## SpendTimeBar7dNo      -0.5171974  0.0419637 -12.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.666202)
## 
##     Null deviance: 2747.4  on 3577  degrees of freedom
## Residual deviance: 2378.3  on 3570  degrees of freedom
## AIC: 8710.7
## 
## Number of Fisher Scoring iterations: 2

As it happened with the previous model, all the variables are significative according to their p-value, an the r2 is very similar to the previous model. Here, the metric that we are going to use to measure the performance is the AIC (Akaike Information Criterion), which have a value of 8695.6

  set.seed(100)

  cvControl <- caret::trainControl(method = "cv", number = 10)
  dtGrid <- expand.grid(cp = (0:10) * 0.01)
  
  fit_dt_grid <- caret::train(
    AlcoholAmountAvgPerMonthCubeRoot ~ Gender + Age + CountryBorn + FamilyPovertyIndex + HighestEducationLevel + DrugCigsUseLast30dSum + SpendTimeBar7d,
    data = dataset_without_na,
    method = "rpart",
    trControl = cvControl,
    tuneGrid = dtGrid
  )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
  fit_dt_grid
## CART 
## 
## 4471 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4024, 4024, 4025, 4024, 4023, 4025, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared    MAE      
##   0.00  0.8992256  0.06799558  0.7140790
##   0.01  0.8375911  0.08978296  0.6820555
##   0.02  0.8396859  0.08399567  0.6834234
##   0.03  0.8499190  0.06274523  0.6978806
##   0.04  0.8499190  0.06274523  0.6978806
##   0.05  0.8499190  0.06274523  0.6978806
##   0.06  0.8580252  0.05453903  0.7069545
##   0.07  0.8767124         NaN  0.7332489
##   0.08  0.8767124         NaN  0.7332489
##   0.09  0.8767124         NaN  0.7332489
##   0.10  0.8767124         NaN  0.7332489
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.

With this approach, we obtained a much lower r-squared of around 0.09.

We can conclude the better model is linear regression, despite its r-squared is low 0.15, it is superior than the others.

Discussion and Future Work

Based on specific observations from the results section, the report clearly provides:

We were expecting to have more surprising and intriguing results, but unfortunately we were many times disappointed. Although we have chosen a huge dataset, it still felt like there was not enough data. Nevertheless, our methods had some clear results such as that alcohol consumption depends on the gender and marital status. Unfortuantely, we did not have the same sample size for all the marital status groups so this could be something to be improved for the future - a better dataset to check the dependence.

The gender dependence was something we expected. In a new project this could futher be studied in relation to other variables, such as the amount of swear words used by each gender, crime activity and other factors that come in a full suite of cultural behaviour.

We also conclude that it is quite difficult to predict alcohol drinking habits as it is not strongly correlated to the factors about which we had information in the dataset.

It is also important to note that the data was gathered from questionnaires, therefore its accuracy is not granted. For future, it would also be interesting to work with datasets that have a different source such as purchase statements.

Finally, as we improve our data processing skills and come across better data and in greater amounts, it would be ideal if we could use artificial neural networks to find patterns regarding alcohol consumption and mental health, as there migt be some tings that we are unable to detect using other methods. The results could potentially show the risks involved with alcohol consumption.