Load packages and data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df <- readr::read_rds("df.rds")
step_1_df <- readr::read_rds("step_1_df.rds")
step_2_a_df <- readr::read_rds("step_2_a_df.rds")
step_2_b_df <- readr::read_rds("step_2_b_df.rds")

Overview

  1. Exploration
  2. Regression models
  3. Binary classification option b
  4. Binary classification option a
  5. Interpretation and optimization

This investigation is divided into 5 main parts, of which we have completed the first on exploratory data analysis. Although our ultimate goal is to fit the data to more comprehensive Bayesian models, we will first focus on getting a feel for the behavior of response_1 as a function of the step 1 inputs, using lm().

Therefore, part ii is split into two sub-parts, iiA and iiB, as such:

iiA. Regression models - lm() iiB. Regression models - Bayesian iiC. Regression models - Models with Resampling

This R Markdown file tackles part iiA, specifically using lm() to fit several linear models to the dataset.

Regression models - lm()

Fitting response_1 to discrete inputs

Let’s fit a linear model for response_1, using a linear relationship with the discrete inputs, xA and xB, as additive terms. That is, we want to build a model for estimating response_1 based on the discrete inputs xA and xB:

\[ \mathrm{response\_1} = b_0 + b_1\mathrm{xA} + b_2\mathrm{xB} \]

With lm() the linear model is easy to fit using the formula interface:

mod_lm_discrete <- lm(response_1 ~ xA + xB, step_1_df)

Summarize the model results with the summary() function.

mod_lm_discrete %>% summary()
## 
## Call:
## lm(formula = response_1 ~ xA + xB, data = step_1_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.2049  -2.6076  -0.2137   2.7004  26.1931 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.3763     0.2426  13.915  < 2e-16 ***
## xAA2          0.3426     0.2051   1.670    0.095 .  
## xBB2         -2.4703     0.2826  -8.743  < 2e-16 ***
## xBB3         -7.7089     0.2881 -26.759  < 2e-16 ***
## xBB4         -1.4109     0.3046  -4.632 3.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.543 on 2008 degrees of freedom
## Multiple R-squared:  0.2954, Adjusted R-squared:  0.294 
## F-statistic: 210.5 on 4 and 2008 DF,  p-value: < 2.2e-16

From above, the p-value of the F-statistic is < 2.2e-16, which is significant; it is likely that at least one or more of the discrete variables are significantly related to response_1.

From the coefficients table, xA seems to have the lowest absolute t-value, and lowest effect on response_1, given that it has the lowest coefficient Estimate. Additionally, xAA2 has the highest p-value for the t test. This tells us that xA is most probably the less significant input.

Computing and storing the confidence interval of the xB coefficients.

mod_lm_discrete_confint <- mod_lm_discrete %>% confint()

Fitting response_1 to continuous inputs

Next, we fit a linear model for response_1 using a linear relationship with the continuous step 1 inputs, x01 to x06:

mod_lm_cont <- lm(response_1 ~ . -xA -xB, step_1_df)

Summarize the model results with the summary() function.

mod_lm_cont %>% summary()
## 
## Call:
## lm(formula = response_1 ~ . - xA - xB, data = step_1_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9007  -2.1221   0.0251   2.2287  22.5315 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.14788   32.36197   1.148   0.2512    
## x01          1.01973    0.06305  16.174  < 2e-16 ***
## x02         -3.36156    0.22277 -15.090  < 2e-16 ***
## x03          0.15331    0.03640   4.212 2.64e-05 ***
## x04          0.07727    0.03852   2.006   0.0450 *  
## x05          0.04013    0.01610   2.493   0.0127 *  
## x06          0.32723    0.29464   1.111   0.2669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.842 on 2006 degrees of freedom
## Multiple R-squared:  0.2003, Adjusted R-squared:  0.1979 
## F-statistic: 83.75 on 6 and 2006 DF,  p-value: < 2.2e-16

From the above result, x01, x02, and x03 are the most significant continuous inputs; observed from their lowest p-values, highest t-statistics, which are also indicated by the *** indicated on their respective rows on the coefficient table.

Computing and storing the confidence interval of the continuous coefficients.

mod_lm_cont %>% confint()
##                     2.5 %       97.5 %
## (Intercept) -26.318713740 100.61446692
## x01           0.896079310   1.14337588
## x02          -3.798443295  -2.92468149
## x03           0.081931903   0.22469779
## x04           0.001730871   0.15280775
## x05           0.008562060   0.07169289
## x06          -0.250608248   0.90505900

Fitting response_1 to all step 1 inputs

Now, we want to fit a linear model for response_1 using a linear relationship with the all step 1 inputs, x01 to x06, as additive terms:

mod_lm_step1 <- lm(response_1 ~ ., step_1_df)

Summarize the model results with the summary() function.

mod_lm_step1 %>% summary()
## 
## Call:
## lm(formula = response_1 ~ ., data = step_1_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.8852  -2.1475  -0.1666   2.0917  19.4622 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 180.491682  27.664490   6.524 8.63e-11 ***
## xAA2          0.432349   0.173763   2.488   0.0129 *  
## xBB2         -2.407999   0.253956  -9.482  < 2e-16 ***
## xBB3         -7.868357   0.244465 -32.186  < 2e-16 ***
## xBB4         -1.273069   0.274611  -4.636 3.78e-06 ***
## x01           0.861475   0.050990  16.895  < 2e-16 ***
## x02          -4.136833   0.181353 -22.811  < 2e-16 ***
## x03           0.064684   0.029526   2.191   0.0286 *  
## x04           0.037112   0.031071   1.194   0.2324    
## x05          -0.003172   0.013071  -0.243   0.8083    
## x06          -0.157523   0.239080  -0.659   0.5101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.843 on 2002 degrees of freedom
## Multiple R-squared:  0.4973, Adjusted R-squared:  0.4948 
## F-statistic:   198 on 10 and 2002 DF,  p-value: < 2.2e-16

Fitting response_1 to a basis function

Finally, let’s define a basis function of the inputs on which to fit to fit a linear model, using lm(). Considering the most significant inputs from fitting all step 1 inputs, and defining natural spline functions for the continuous inputs with an appropriate degree of freedom, then modeling interactions among all the most significant inputs, we get:

mod_lm_basis <- lm(response_1 ~ xA + xB*(splines::ns(x01, df = 2) + splines::ns(x02, df = 2) + splines::ns(x03, df = 2)), step_1_df)

Summarize the model results with the summary() function.

mod_lm_basis %>% summary()
## 
## Call:
## lm(formula = response_1 ~ xA + xB * (splines::ns(x01, df = 2) + 
##     splines::ns(x02, df = 2) + splines::ns(x03, df = 2)), data = step_1_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1094 -1.1645  0.0061  1.1703  6.0096 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     44.06430    0.93666  47.044  < 2e-16 ***
## xAA2                             0.32502    0.08148   3.989 6.88e-05 ***
## xBB2                           -20.77093    1.45378 -14.288  < 2e-16 ***
## xBB3                           -96.08513    1.29596 -74.142  < 2e-16 ***
## xBB4                           -21.36615    1.57853 -13.535  < 2e-16 ***
## splines::ns(x01, df = 2)1      -25.01097    1.03390 -24.191  < 2e-16 ***
## splines::ns(x01, df = 2)2       15.46004    0.55859  27.677  < 2e-16 ***
## splines::ns(x02, df = 2)1      -24.71942    0.92389 -26.756  < 2e-16 ***
## splines::ns(x02, df = 2)2       -4.96759    0.58559  -8.483  < 2e-16 ***
## splines::ns(x03, df = 2)1      -23.92220    1.12445 -21.275  < 2e-16 ***
## splines::ns(x03, df = 2)2        7.48919    0.75993   9.855  < 2e-16 ***
## xBB2:splines::ns(x01, df = 2)1  25.21496    1.64141  15.362  < 2e-16 ***
## xBB3:splines::ns(x01, df = 2)1  68.74650    1.35603  50.697  < 2e-16 ***
## xBB4:splines::ns(x01, df = 2)1   7.31292    1.72922   4.229 2.45e-05 ***
## xBB2:splines::ns(x01, df = 2)2  -9.31338    0.68195 -13.657  < 2e-16 ***
## xBB3:splines::ns(x01, df = 2)2 -20.17660    0.86424 -23.346  < 2e-16 ***
## xBB4:splines::ns(x01, df = 2)2  -3.28783    0.71287  -4.612 4.24e-06 ***
## xBB2:splines::ns(x02, df = 2)1   0.50466    1.43248   0.352  0.72465    
## xBB3:splines::ns(x02, df = 2)1  24.25162    1.22995  19.718  < 2e-16 ***
## xBB4:splines::ns(x02, df = 2)1  22.15519    1.47612  15.009  < 2e-16 ***
## xBB2:splines::ns(x02, df = 2)2  -0.70131    0.73326  -0.956  0.33897    
## xBB3:splines::ns(x02, df = 2)2  -8.78045    0.89353  -9.827  < 2e-16 ***
## xBB4:splines::ns(x02, df = 2)2  -9.63658    0.75418 -12.778  < 2e-16 ***
## xBB2:splines::ns(x03, df = 2)1   6.86556    1.69934   4.040 5.55e-05 ***
## xBB3:splines::ns(x03, df = 2)1  62.25154    1.54186  40.374  < 2e-16 ***
## xBB4:splines::ns(x03, df = 2)1   6.23329    1.94525   3.204  0.00138 ** 
## xBB2:splines::ns(x03, df = 2)2  -2.67150    0.88862  -3.006  0.00268 ** 
## xBB3:splines::ns(x03, df = 2)2 -22.96057    1.00569 -22.831  < 2e-16 ***
## xBB4:splines::ns(x03, df = 2)2  -0.89516    0.93233  -0.960  0.33711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.796 on 1984 degrees of freedom
## Multiple R-squared:  0.8911, Adjusted R-squared:  0.8896 
## F-statistic: 579.9 on 28 and 1984 DF,  p-value: < 2.2e-16

Comparing lm() models

The linear model fit using the basis function is the best. This model is chosen because it has the highest adjusted R-squared value of 0.8896 which is closest to 1 than all 3 other models. Also, it has the lowest residual standard error of 1.796.

The second best model is the linear model fit using all step 1 inputs.

Visualization

Load in package coefplot:

library(coefplot)

Compute confidence intervals for best and second best models:

mod_lm_basis_confint <- mod_lm_basis %>% confint()
mod_lm_basis_confint
##                                      2.5 %      97.5 %
## (Intercept)                     42.2273462  45.9012456
## xAA2                             0.1652203   0.4848259
## xBB2                           -23.6220220 -17.9198418
## xBB3                           -98.6267100 -93.5435468
## xBB4                           -24.4619019 -18.2703967
## splines::ns(x01, df = 2)1      -27.0386074 -22.9833389
## splines::ns(x01, df = 2)2       14.3645420  16.5555293
## splines::ns(x02, df = 2)1      -26.5313185 -22.9075272
## splines::ns(x02, df = 2)2       -6.1160268  -3.8191522
## splines::ns(x03, df = 2)1      -26.1274276 -21.7169800
## splines::ns(x03, df = 2)2        5.9988492   8.9795390
## xBB2:splines::ns(x01, df = 2)1  21.9958858  28.4340253
## xBB3:splines::ns(x01, df = 2)1  66.0871028  71.4059006
## xBB4:splines::ns(x01, df = 2)1   3.9216464  10.7041935
## xBB2:splines::ns(x01, df = 2)2 -10.6508058  -7.9759638
## xBB3:splines::ns(x01, df = 2)2 -21.8715080 -18.4816822
## xBB4:splines::ns(x01, df = 2)2  -4.6858750  -1.8897790
## xBB2:splines::ns(x02, df = 2)1  -2.3046653   3.3139781
## xBB3:splines::ns(x02, df = 2)1  21.8395034  26.6637438
## xBB4:splines::ns(x02, df = 2)1  19.2602715  25.0501064
## xBB2:splines::ns(x02, df = 2)2  -2.1393394   0.7367230
## xBB3:splines::ns(x02, df = 2)2 -10.5328027  -7.0281022
## xBB4:splines::ns(x02, df = 2)2 -11.1156458  -8.1575067
## xBB2:splines::ns(x03, df = 2)1   3.5328891  10.1982350
## xBB3:splines::ns(x03, df = 2)1  59.2277077  65.2753713
## xBB4:splines::ns(x03, df = 2)1   2.4183397  10.0482474
## xBB2:splines::ns(x03, df = 2)2  -4.4142243  -0.9287775
## xBB3:splines::ns(x03, df = 2)2 -24.9328969 -20.9882481
## xBB4:splines::ns(x03, df = 2)2  -2.7236170   0.9332961
mod_lm_step1_confint <- mod_lm_step1 %>% confint()
mod_lm_step1_confint
##                     2.5 %      97.5 %
## (Intercept) 126.237476952 234.7458861
## xAA2          0.091573496   0.7731248
## xBB2         -2.906045380  -1.9099529
## xBB3         -8.347789337  -7.3889237
## xBB4         -1.811621774  -0.7345152
## x01           0.761476901   0.9614731
## x02          -4.492493292  -3.7811717
## x03           0.006779131   0.1225894
## x04          -0.023822432   0.0980472
## x05          -0.028806049   0.0224628
## x06          -0.626394808   0.3113489

Plot model coefficients for best model:

coefplot(mod_lm_basis, innerCI = 2, pointSize = 2)

Plot model coefficients for second best model:

coefplot(mod_lm_step1, innerCI = 2, pointSize = 2)

In general, more coefficients of the basis model are further away from 0 than the all step 1 inputs model. For example, coefficients of terms containing x01 for the basis model are both negative and positive, while x01 term in all step 1 inputs model is slightly negative, but much closer to 0. This explains the higher R-squared for the basis model. For both models, the 95% confidence intervals for the coefficients are small.

Saving

mod_lm_basis %>% readr::write_rds("mod_lm_basis.rds")
mod_lm_step1 %>% readr::write_rds("mod_lm_step1.rds")