Final Project Spring 2022
1. Introduction.
The data analysis of this experiment uses Diabetes Health Indicators Dataset to find the
most important tuning parameters of random forest that drive the so-called crossvalidation accuracy. The analysis will used a Random Forest algorithm with a cvrf cross
validation function in order to achieve the expected goal.Two experimental designs will be
proposed, that is fractional factorial design and optimal design, after which, each design
will be evaluated and the best design recommended for the analysis.
2. Methodology
# Clear memory
rm( list=ls() )
#loading libraries
library(randomForest)
#Loading the function
source("CrossValidation_RF.R")
#Loading data
load("diabetes.RData")
#Question 1. Propose a fractional factorial design for the problem. In addition,
propose an experimental design constructed using the optimal design approach.
###Using fractional design
design1 <- FrF2(64, 7, generators = "ABCDEF", factor.names =
list(ntree =c(100,1000), mtry =c(2,6),replace =c(0,1),
nodesize =c(1,11),classwt =c(0.5,0.9), cutoff
=c(0.2,0.8),
maxnodes=c(10,1000)), randomize = FALSE)
design1
## ntree mtry replace nodesize classwt cutoff maxnodes
## 1 100 2 0 1 0.5 0.2 1000
## 2 1000 2 0 1 0.5 0.2 10
## 3 100 6 0 1 0.5 0.2 10
## 4 1000 6 0 1 0.5 0.2 1000
## 5 100 2 1 1 0.5 0.2 10
## 6 1000 2 1 1 0.5 0.2 1000
## 7 100 6 1 1 0.5 0.2 1000
## 8 1000 6 1 1 0.5 0.2 10
## 9 100 2 0 11 0.5 0.2 10
## 10 1000 2 0 11 0.5 0.2 1000
## 11 100 6 0 11 0.5 0.2 1000
## 12 1000 6 0 11 0.5 0.2 10
## 13 100 2 1 11 0.5 0.2 1000
## 14 1000 2 1 11 0.5 0.2 10
## 15 100 6 1 11 0.5 0.2 10
## 16 1000 6 1 11 0.5 0.2 1000
## 17 100 2 0 1 0.9 0.2 10
## 18 1000 2 0 1 0.9 0.2 1000
## 19 100 6 0 1 0.9 0.2 1000
## 20 1000 6 0 1 0.9 0.2 10
## 21 100 2 1 1 0.9 0.2 1000
## 22 1000 2 1 1 0.9 0.2 10
## 23 100 6 1 1 0.9 0.2 10
## 24 1000 6 1 1 0.9 0.2 1000
## 25 100 2 0 11 0.9 0.2 1000
## 26 1000 2 0 11 0.9 0.2 10
## 27 100 6 0 11 0.9 0.2 10
## 28 1000 6 0 11 0.9 0.2 1000
## 29 100 2 1 11 0.9 0.2 10
## 30 1000 2 1 11 0.9 0.2 1000
## 31 100 6 1 11 0.9 0.2 1000
## 32 1000 6 1 11 0.9 0.2 10
## 33 100 2 0 1 0.5 0.8 10
## 34 1000 2 0 1 0.5 0.8 1000
## 35 100 6 0 1 0.5 0.8 1000
## 36 1000 6 0 1 0.5 0.8 10
## 37 100 2 1 1 0.5 0.8 1000
## 38 1000 2 1 1 0.5 0.8 10
## 39 100 6 1 1 0.5 0.8 10
## 40 1000 6 1 1 0.5 0.8 1000
## 41 100 2 0 11 0.5 0.8 1000
## 42 1000 2 0 11 0.5 0.8 10
## 43 100 6 0 11 0.5 0.8 10
## 44 1000 6 0 11 0.5 0.8 1000
## 45 100 2 1 11 0.5 0.8 10
## 46 1000 2 1 11 0.5 0.8 1000
## 47 100 6 1 11 0.5 0.8 1000
## 48 1000 6 1 11 0.5 0.8 10
## 49 100 2 0 1 0.9 0.8 1000
## 50 1000 2 0 1 0.9 0.8 10
## 51 100 6 0 1 0.9 0.8 10
## 52 1000 6 0 1 0.9 0.8 1000
## 53 100 2 1 1 0.9 0.8 10
## 54 1000 2 1 1 0.9 0.8 1000
## 55 100 6 1 1 0.9 0.8 1000
## 56 1000 6 1 1 0.9 0.8 10
## 57 100 2 0 11 0.9 0.8 10
## 58 1000 2 0 11 0.9 0.8 1000
## 59 100 6 0 11 0.9 0.8 1000
## 60 1000 6 0 11 0.9 0.8 10
## 61 100 2 1 11 0.9 0.8 1000
## 62 1000 2 1 11 0.9 0.8 10
## 63 100 6 1 11 0.9 0.8 10
## 64 1000 6 1 11 0.9 0.8 1000
## class=design, type= FrF2.generators
The factorial design proposed for this problem is 2^(k-p) Fractional Factorial Design.
Additionally, I have proposed D-Optimal Design as an experimental design for this problem
which has been constructed as below.
####Using Optimal design####
cand.list = expand.grid(ntree =c(100,1000), mtry =c(2,6),replace =c(0,1),
nodesize =c(1,11),classwt =c(0.5,0.9), cutoff
=c(0.2,0.8),
maxnodes=c(10,1000))
set.seed(689)
optim<-optFederov( ~ ., data = cand.list, nTrials = 22)
print(optim$design)
## ntree mtry replace nodesize classwt cutoff maxnodes
## 1 100 2 0 1 0.5 0.2 10
## 10 1000 2 0 11 0.5 0.2 10
## 14 1000 2 1 11 0.5 0.2 10
## 16 1000 6 1 11 0.5 0.2 10
## 24 1000 6 1 1 0.9 0.2 10
## 31 100 6 1 11 0.9 0.2 10
## 35 100 6 0 1 0.5 0.8 10
## 44 1000 6 0 11 0.5 0.8 10
## 50 1000 2 0 1 0.9 0.8 10
## 53 100 2 1 1 0.9 0.8 10
## 57 100 2 0 11 0.9 0.8 10
## 66 1000 2 0 1 0.5 0.2 1000
## 67 100 6 0 1 0.5 0.2 1000
## 82 1000 2 0 1 0.9 0.2 1000
## 91 100 6 0 11 0.9 0.2 1000
## 93 100 2 1 11 0.9 0.2 1000
## 101 100 2 1 1 0.5 0.8 1000
## 103 100 6 1 1 0.5 0.8 1000
## 105 100 2 0 11 0.5 0.8 1000
## 110 1000 2 1 11 0.5 0.8 1000
## 120 1000 6 1 1 0.9 0.8 1000
## 124 1000 6 0 11 0.9 0.8 1000
design2<-optim$design
#Question 2. Compare the optimal design with the fractional factorial design in
practicaland statistical terms. For instance, what is the performance of the designs
for studying the main effects of the tuning parameters only? Can they estimate all
two-parameter interactions? Why or why not? How do they compare in terms of
multicollinearity?
#Evaluating Optimal design
model_design<-gen_design(cand.list, model=~
ntree+mtry+replace+nodesize+classwt
+cutoff+maxnodes, trials = 22)
eval_design(model_design)
## parameter type power
## 1 (Intercept) effect.power 0.9904596
## 2 ntree effect.power 0.9904596
## 3 mtry effect.power 0.9904596
## 4 replace effect.power 0.9904596
## 5 nodesize effect.power 0.9904596
## 6 classwt effect.power 0.9904596
## 7 cutoff effect.power 0.9904596
## 8 maxnodes effect.power 0.9904596
## 9 (Intercept) parameter.power 0.9904596
## 10 ntree parameter.power 0.9904596
## 11 mtry parameter.power 0.9904596
## 12 replace parameter.power 0.9904596
## 13 nodesize parameter.power 0.9904596
## 14 classwt parameter.power 0.9904596
## 15 cutoff parameter.power 0.9904596
## 16 maxnodes parameter.power 0.9904596
## ============Evaluation Info=============
## [1m• Alpha = [0m0.05 [1m• Trials = [0m22 [1m• Blocked = [0mFALSE
## [1m• Evaluating Model = [0m~ntree + mtry + replace + nodesize + classwt +
cutoff + maxnodes
## [1m• Anticipated Coefficients = [0mc(1, 1, 1, 1, 1, 1, 1, 1)
#Evaluating fractional design
model_design1<-gen_design(design1, model=~
ntree+mtry+replace+nodesize+classwt
+cutoff+maxnodes, trials = 22)
eval_design(model_design1)
## parameter type power
## 1 (Intercept) effect.power 0.9899252
## 2 ntree effect.power 0.9899252
## 3 mtry effect.power 0.9899252
## 4 replace effect.power 0.9899252
## 5 nodesize effect.power 0.9899252
## 6 classwt effect.power 0.9899252
## 7 cutoff effect.power 0.9899252
## 8 maxnodes effect.power 0.9899252
## 9 (Intercept) parameter.power 0.9899252
## 10 ntree1 parameter.power 0.9899252
## 11 mtry1 parameter.power 0.9899252
## 12 replace1 parameter.power 0.9899252
## 13 nodesize1 parameter.power 0.9899252
## 14 classwt1 parameter.power 0.9899252
## 15 cutoff1 parameter.power 0.9899252
## 16 maxnodes1 parameter.power 0.9899252
## ============Evaluation Info=============
## [1m• Alpha = [0m0.05 [1m• Trials = [0m22 [1m• Blocked = [0mFALSE
## [1m• Evaluating Model = [0m~ntree + mtry + replace + nodesize + classwt +
cutoff + maxnodes
## [1m• Anticipated Coefficients = [0mc(1, 1, 1, 1, 1, 1, 1, 1)
#checking multicollinearity of Optimal design
plot_correlations(model_design)
#checking multicollinearity of fractional design
plot_correlations(model_design1)
I have compared the optimal design with the fractional factorial design. Statistically, I
evaluated the powers of both designs. The power of a design can be defined as the
probability of a design determining that an effect is statistically significant in the model.
From the evaluation results, optimal design has the highest power of 0.9904596 while
fractional factorial design has the least power of 0.9899252. This means that it has the
highest probability to determine that an effect is statistically significant in the
model.However, both designs can estimate all two-parameter interactions because they
both have a relatively higher design power.
Comparing the two designs in terms of multicollinearity, optimal design exhibits low
multicollinearity between the tuning parameters as evident in the correlation matrix
above. Similarly, fractional factorial design exhibits a high multicollinearity between the
tuning parameters as evident in the correlation matrix above.
#Question 3. Recommend one experimental design between the two options in
Question 1. Motivate your decision.
I recommend optimal design because of two reasons. First, it has a higher design power.
This means that it has the highest probability to determine that an effect is statistically
significant in the model compared to fractional factorial design. Second, it exhibits low
multicollinearity between the tuning parameters, hence the best design for the analysis.
#Question 4. Using a commercial software, the TAs and I came up with the
experimental design shown in Table 2. How does your recommended design in the
previous question compare with this one?
I used R software to generate my table. Table 2 contains 22 runs while my recommended
design contains 124 runs. Similarly, my recommended design table has both the seven
tuning parameters of the random forest as in Table 2. However, my recommended design
table contains only two levels for each parameter while table 2 contains three levels for all
the tuning parameters except “replace” parameter which contains 0 and 1.
###Part II: Data Analysis
#Question 5. Collect data using your recommended design in Question 3.
#Using the function
results <- cv.rf(design2, y, X)
## Collecting response on test combination 1
## Collecting response on test combination 2
## Collecting response on test combination 3
## Collecting response on test combination 4
## Collecting response on test combination 5
## Collecting response on test combination 6
## Collecting response on test combination 7
## Collecting response on test combination 8
## Collecting response on test combination 9
## Collecting response on test combination 10
## Collecting response on test combination 11
## Collecting response on test combination 12
## Collecting response on test combination 13
## Collecting response on test combination 14
## Collecting response on test combination 15
## Collecting response on test combination 16
## Collecting response on test combination 17
## Collecting response on test combination 18
## Collecting response on test combination 19
## Collecting response on test combination 20
## Collecting response on test combination 21
## Collecting response on test combination 22
#printing results
print(results)
## ntree mtry replace nodesize classwt cutoff maxnodes CV
## 1 100 2 0 1 0.5 0.2 10 0.6804450
## 10 1000 2 0 11 0.5 0.2 10 0.6829331
## 14 1000 2 1 11 0.5 0.2 10 0.6828889
## 16 1000 6 1 11 0.5 0.2 10 0.7073731
## 24 1000 6 1 1 0.9 0.2 10 0.4999999
## 31 100 6 1 11 0.9 0.2 10 0.5000000
## 35 100 6 0 1 0.5 0.8 10 0.6963337
## 44 1000 6 0 11 0.5 0.8 10 0.6957381
## 50 1000 2 0 1 0.9 0.8 10 0.5001068
## 53 100 2 1 1 0.9 0.8 10 0.5003334
## 57 100 2 0 11 0.9 0.8 10 0.5008708
## 66 1000 2 0 1 0.5 0.2 1000 0.6634838
## 67 100 6 0 1 0.5 0.2 1000 0.6253510
## 82 1000 2 0 1 0.9 0.2 1000 0.5007201
## 91 100 6 0 11 0.9 0.2 1000 0.5230136
## 93 100 2 1 11 0.9 0.2 1000 0.5000890
## 101 100 2 1 1 0.5 0.8 1000 0.6701066
## 103 100 6 1 1 0.5 0.8 1000 0.6488353
## 105 100 2 0 11 0.5 0.8 1000 0.6827733
## 110 1000 2 1 11 0.5 0.8 1000 0.6807332
## 120 1000 6 1 1 0.9 0.8 1000 0.6507469
## 124 1000 6 0 11 0.9 0.8 1000 0.7124755
#Question 6. Conduct a detailed data analysis. What are the influential tuning
parameters? What is the final model that links the tuning parameters to the crossvalidation accuracy? Does the final model provide a good fit to the data?
model<-randomForest(X,y, ntree=c(1000), mtry=c(6), replace = c(0), nodesize
=c(11), classwt=c(0.9,0.5), cutoff=c(0.8,0.2), maxnodes=c(1000))
print(model)
##
## Call:
## randomForest(x = X, y = y, ntree = c(1000), mtry = c(6), replace = c(0),
classwt = c(0.9, 0.5), cutoff = c(0.8, 0.2), nodesize = c(11), maxnodes
= c(1000))
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 26.45%
## Confusion matrix:
## 0 1 class.error
## 0 8138 4362 0.34896
## 1 2251 10249 0.18008
model_predict = predict(model)
mean(model_predict==y)
## [1] 0.73548
The influential tuning parameters are ntree=1000, mtry=6, replace=0, nodesize=11,
classwt=0.9, cutoff=0.8 and maxnodes = 1000. This gives the highest cross validation
accuracy of 0.7124755.The final model that links the tuning parameters to the crossvalidation accuracy is model<-randomForest(X,y, ntree=c(1000), mtry=c(6), replace = c(0),
nodesize =c(11), classwt=c(0.9,0.5), cutoff=c(0.8,0.2), maxnodes=c(1000)). The final model
gives a good fit for the data with a 73.428% accuracy.
3. Conclusions
The design I recommended worked well because the design power was high enough to
ensure that the effects were significant to the model. Additionally, the model worked so
well because the accuracy of 73.428% indicated a good fit of the model. For future
experimentation, it is important to change the cutoff and classwt tuning parameters to
understand how it will affect the accuracy of the model.
Thursday, June 16, 2022
Another one very okay
Subscribe to:
Posts (Atom)
The Need for Efficient Cable Organizer in the Digital Age
Table of Contents 1 Introduction . 5 2 Literature review .. 5 3 Quality of Theoretical Foundati...

-
9.8 LAB - Database programming with Python (SQLite) ...
-
9.9 LAB - Database programming with Java (SQLite) ...
-
Repeat Purchase in a competitive E-Commerce Name Institution ...