Repeat Purchase in a competitive
E-Commerce
Name
Institution
1.
Section 1: The Problem (10%) 900
a)
Discuss the problem
you are addressing.
Customer recommendations and rating of a product is
important in the any industry, especially where populations and businesses scramble
satisfy customers, there is therefore a need to measure the extent of customer
satisfaction and this can be done using machine learning paradigm. Most
companies allow consumers freedom of choice to choose from a wide range of
products but they could lose when they do not analyze their satisfaction levels
and this was determined by the product review score, which determined how
likely they would recommend the product to another person (Souza, 2021). The product review score data analysis is
necessary in forecast, it is important in determining future purchases and how
the company can adjust to the changing business environment. There are many reasons for analyzing review
score, such as neglecting a given market niche, and when a purchase behavior
cannot be explained, it means that the company is experiencing a purchase
problem that needs to be addressed (Facundo, 2021).
Customer review is done to determine how it can be
reduced. The process uses data intelligence
on review score, many other insights can be useful from the data analysis. Analyzing the purchase data helps businesses
to track other Key Performance Indicator and also establish customer behavior,
and also increase customer touchpoints. Analyzing customer behavior is also
important in a given market niche (Patel, 2020). Assessing the products through
customer review analysis also helps to maintain the current customers and also
adds to business bottom line operations.
b)
Importance of Customer Purchase Analysis
Every business takes advantage of analyzing their annual
purchase behavior and they do this to reduce the purchase by identifying the
factors that lead to customer purchase behavior. The analysis helps the
business address the root problems that affect customer recommendations and
rating. The other advantages include
reducing the cost of customer acquisition (Matuszelański & Kopczewska, 2022).
Expanding the business base through customer acquisition by
recommendation important for any company, on the other hand, there is a need to
retain the existing customers, when this is not done, it leads to a high customer
acquisition cost. The customer service department has a lot to do in terms of
retaining customers, when the current customer base is satisfied with the
products, it will lead to less costs of new customer acquisition (Kwak & Clayton-Matthews, 2002). The main
aim of the analysis is to identify variables are highly correlated to customer
review, such as switching to a competitor, lack of subscription renewal or just
closing an account. The analysis also determines different purchase indicators
such as unpredictable purchase spikes, price changes that may cause more
purchases, or when the sales department is struggling to sell, which may lead
to increased customer purchases. Other factors include changes in the market, when
there is a hidden problem in the business that makes customers rate products
poorly (Kwak & Clayton-Matthews, 2002).
c)
What are the questions and business/management decisions your analysis is trying
to address?
The question the analysis addresses how different variables
such as price, freight value, product name length, description, quality of
product photos, and delivery time (de Jong, T. (2021). When the variables that
contribute highly to customer’s product review identified, then the department
that deals with the specific behavior can agree on the strategic steps on how
to influence future purchase behavior, in the long run, the purchase will
reveal specific solutions that will be important in the business bottom line (de
Jong, 2021).
d)
Describe your problem’s decision-maker and what is important for them to know from your data
analysis?
The analysis needs to give the strategic marketing
department the most significant indicator of review score, whether it is gender,
taste and preferences or changes in the general behavior of the market, the marketers
needs to take action on the best way to ensure customer retention and
recommendation, such as through improved communication, shortening the product
name length, and improving product picture quality (Klimantaviciute, 2021).
e)
Discuss the source
of your data. Questions to consider include:
The data has been taken from a central database Kaggle.com, Since
the data source is from a published source, and a renowned repository, the data
source is reliable and can be used in decision making. The data is 3 years old
since it was recorded in 2019. The data was recorded to observe the purchasing behavior
during autumn the same year (Wong & Marikannan, 2020, December).
f)
Identification and justification of your choice of the
target attribute
The chosen attributes that have been used in the study are
all the categorical variables and they were selected depending on their
correlation to the customer product rating (review score) (Kwak & Clayton-Matthews, 2002). The
dependent variable was customer review score. The variable was studied using
the independent variables price, freight value, product name length,
description, quality of product photos, and delivery time (Kwak & Clayton-Matthews, 2002). All these
factors are important, for example, if online security leads to more customer
purchases, then it means that the cyber security department should have
policies that will lower customer purchases. When it is paperless billing that
makes customers lose touch with the current client base, then the business can
think of ways of bringing customers closer to the business (Eldik, 2021). Poor
internet service may also make customers move to a rival business.
2.
Section 2: Understand the Data
a)
Discuss the nature and size of the dataset(s) you are using.
The data size used has over 90,000 observations, this can be
shown by r-code nrow shows an outcome of 95,000 observations. The data is from
the Brazilian public orders that were made at Olist Store. The dataset has
information from 2016 to 2019 at multiple market-places in Brazil, the data
allows viewing of orders from multiple dimensions, such as order status, payment,
price, and payment, and geolocation of the data set (Al-Basha, 2021). The
dataset has information about the customer and their location, the orders can
be used in finding the orders delivered and location where these orders have bee
delivered. Every order is assigned unique customer ID, and this means that
customers will be assigned different ID for every order, the unique id is
important in identifying unique customer that made an order to the store
(Al-Basha, 2021).
The discussion is about what drives the customers to change
their behavior. Multiple regression approach was used to determine most
significant factor. The variables as stated above were review score, price, freight
value, product name length, description, quality of product photos, and
delivery time (de Jong, 2021).
b)
Discuss the data attributes that are relevant to your
problem.
The given is a five-star rating system that summarizes the
overall satisfaction of the customers with the products that they purchase, the
data can be converted to a binary classification by treating a 5-star rating as
a positive and the rest as negative. The data has multiple tables, each having
its own categorical and also numerical columns. After analysis, it is found
that the five-star rating accounts for 76% of the total data sets, while the
rest have negative ratings. When looking at the best category for sales, then
furniture, décor, and beauty products are the most popular categories.
c)
Tableau visualizations

Customer locations by region
Tableau graph shows the customers by region and majority of
the customers are from Brazil with Majority From Sao Paulo City.

Top Ordered Categories that generate more
money
The company needs to understand the top ordered category
since this is the category that is the cash cow for the company.

Importance of Pictures to customer
•
R-generated behavior
plots or aggregation tables

The highest category is SP while the CE is the least
bought category of products.
d)
Product Score reviews
The diagram above is the correlation matrix, from the
diagram, blue shows the perfect correlation and red shows poor correlation.
From the figure, it is observed that price review and delivery time are poorly
correlated. Freight score is however highly correlated to the review score. When
the data was cleansed and correlation matrix built again, the following correlation
matrix was the result. There is a high correlation between the freight value
and price, the least correlation is between delivery time and review score. The
company is therefore doing well when it comes to delivery times.

The model shows high
correlation between the total price and the freight value.
3.
Section 3: Prepare
the Data
a)
Derivation of attributes
The analysis was based on the correlation between all
important variables such as review core and the freight value, the length of
the product name, photo quality, freight ratio, total price, and the freight
ratio. R analysis also gave the top
product categories, overall the company is evaluating the cause-effect analysis
by finding out why some products are ranked higher by customers than other
products.
b)
Discuss and justify what other steps you may have taken
to prepare your data
From the raw data, there is a possibility that there are
rows that may be repeated, the rows are therefore redundant and have to be
removed, this was done by dropping some rows that may not be used in the
observation, the following R code was used in dropping the observations;
drops
<- c("order_status","order_delivered_carrier_date","order_purchase_timestamp",
"order_approved_at","customer_zip_code_prefix","customer_unique_id","customer_city","review_id","review_comment_title","review_comment_message",v"review_creation_date","review_answer_timestamp","order_item_id","seller_zip_code_prefix",
"seller_city","product_category_name","order_delivered_month","product_id","seller_id",order_id","customer_id","order_delivered_customer_date","order_estimated_delivery_date", "customer_state","seller_state","payment_sequential","payment_value","del_time_bucket","payment_type","product_weight_g","product_height_cm",
"shipping_limit_date","product_length_cm","product_width_cm","payment_installments")orders_all_1
<- orders_all[ , !(names(orders_all) %in% drops)]
The columns and the rows were dropped because customers
cannot post the same order twice, for the same product at the same time. From
the data, it is also important to have dates be in a standard format, it is
also important to differentiate date parameters such as the day, year, and time
to use every parameter differently in the computations (Field, 2009). For categorical data, the mode of the columns was
used, and all missing data were dropped during the analysis. Since the research
also uses a binary classification task, there is a need to create a new column,
that contains the labels, the columns used in the analysis also consist of
numerical data, and using descriptive with these data is the easiest way to
understand the data (de Jong, 2021).
c)
"Training", “Validation” (if
required), and "Testing" subsets of the
dataset.
The training and test data were prepared as follows;
The
data for modeling was divided as follows;
#####Low Med High####
orders_all_2$review_score_1
<- ifelse(orders_all_2$review_score<3,0,1)
orders_all_2$review_score
<- orders_all_2$review_score_1
####Next
model####--Logistic model--####
set.seed(13)
train.index_1 <-
createDataPartition(orders_all_2$review_score, p = 0.8, list = FALSE)
train_1.df <-
orders_all_2[train.index_1, ]
valid_1.df <-
orders_all_2[-train.index_1, ]
In this code,
training1_df is the training data set, while validation data is the testing
dataset.
The createDataPartition method available in R was used in partitioning
the data, with the train taking 80% while test, portioned 20%.
4.
Section 4: Review, behavior and Test Prediction Models
(40%)
a)
Configuration of the models
The models have configured models (e.g. select attributes
and/or other model parameters) that are expected to best deliver relevant
insights and/or provide the lowest error behaviors, the attributes included freight
value, product photo quantity, total price, and the day of the week of the
product purchase. The models chosen were the simplest that could give an
insight to the data. The models used
included multiple regression linear model, that was used to computer the
correlation coefficients between the dependent variable customer review and the
independent variables total price, purchase and total price.
b)
The analysis used 3 models;
Linear Model: the linear model
was used to fit the review score versus other attributes such as freight value,
product photo quantity, total price, and day of purchase of the product.
Another factor used as a predictor is the product description length; the
following is the result of the multiple linear regression.

This model calculated the correlation between freight value,
product name length, product description length, product photo quality, total
price, freight ratio and purchase day of the week, negative correlation were
found in the freight value, product name length, purchase day of week and
freight ratio. The model shows that the most significant predictor is total
price, followed by product name length. Picture quality also plays an important
factor.
From the data above, the r-squared value is approaching 1,
this shows that the model fits the data, and the information given by the
linear model can be relied upon. The p-value of the intercept is less than
alpha, and this shows that we reject the null hypothesis and conclude that the
parameters are significant to the customer rating. When looking at the
coefficients, the total price, the purchase day of the week, and product name
length are all predictors of the customer rating. Almost all the variables have
the same significance to the customer score.
The total price however is the most significant predictor of the
customer review score.
Accuracy of the linear model was tested and the following is
the results,

The p-value is zero and this means that the null hypothesis
is rejected, therefore there is a significant correlation between the
variables. AIC and BIC values are large, and this shows that the model does not
all the values and cannot be relied upon. The data was then configured,
calculated variables were used in the next model, variables such as time to
deliver, was calculated from the difference between order time and delivery
time, the new variable total price was calculated from price added to the
freight price. Another calculated field freight ratio was calculated from
fields freight value divided by order price. Then other columns that were not
used in the analysis such as delivery month, delivery time and estimated
delivery time were deleted from the dataset. The calculated fields were then
used in the second linear model, the model calculated the correlation of the
review score and the calculated fields such as time to deliver, freight ratio
and freight price. The result was as follows;

From the result, the intercept of 4.27 is statistically
significant, and the total that has a positive correlation with the customer
review has the highest intercept, hence the most significant predictor.
According to the model, this is followed by product description length that
also has a positive correlation to the customer review. There are variables
with positive correlation with the review, they include total price, product
description length, product name length and total price.
Accuracy of the model is as shown
below;

From the R-Squared value, the model is
more accurate since it has smaller value when compared to the first linear
model. Adjusted r-squared for the model is 0.000223, sigma of 1.28 and AIC of 25606.
The model scores better in accuracy than the previous linear model.
Logistic regression: The
logistic regression was also used to fit the data and the following was the
output of the results; Number of Fisher Scoring iterations: 1. The regression
analysis shows that product photo quality is the best predictor. Followed by
price. The AIC score however shows high values, showing that logistic
regression does not fit the data like the linear regression above. The result
of the logistic regression is as shown below;

The model also confirms the fact the total price is the
highest predictor.
c) Ensemble Model
The ensemble model used is 10
fold logistic regression with 3 iterations, the model combines the linear model
ad multiple linear regression.
algorithms_to_use
<- c('glm', lm, ‘lm’, 'svmRadial')
stacked_models
<- caretList(review_score ~., data=orders_all_1, trControl=control_stacking,
methodList=algorithms_to_use)
stacking_results
<- resamples(stacked_models)
The scoring rule to measure the accuracy of the model is the
use of the AIC score, from the results, the ensemble had the lowest AIC. The
ensemble model returned better scores and showed that price was the most
Discuss what steps you may have taken to
improve your individual models.
d)
Improving the models
To improve the Ensemble Model algorithm,
the weighted k-nearest neighbors were applied for the ensemble model fit, and
performance is said to be at best when the k=3, despite the improvement, ensemble
was not a good classifier for the data. Initially, the model was set to k = 2.
By fitting the knn with a k=3, the model was improved as shown in the code
section. The score for ensemble was 0.856 and this is low when compared to
other classifiers. The account of false-positive points is higher than the
count of false-negative and this is not good for classification (Menard, 2000). The code below was used to
improve the model;
stackControl <-
trainControl(method="repeatedcv", number=3, repeats=3,
savePredictions=TRUE, classProbs=TRUE)
As it can be seen in the above
code, the number of k was increased to 3 from the previous 2 in order to
improve the model accuracy.
Additionally, when it comes to
logistic regression, there is what is called an inverse regularization constant
whose values are the power of 10, the model performs well when the C value is
10e-2. After validation, the F1 score for the classifier is 0.89 which is
better than the ensemble, the count for true negatives is however lesser than
the count of false positives. The last
of the algorithms are random forest classifier. The hyperparameters have been
applied in the count of the base estimators, which are decision trees for this
case, six values are chosen between 50 and 210 as the number of estimators,
Where the limits are inclusive set. The score for this validation is 0.89 and
this is a good validator.
e)
Improving the algorithms using XGBoost
The hyper-parameters, used in this case are decision trees,
in this case, the values are chosen between 50 and 210, and the number of estimators is limited as
mentioned in the ensemble. The AIC score for XGBoost is 0.89. Another way of
boosting Multi-layered Perception, is now there are multi-layered perceptions
for classification, the optimizer used in this case is Adam and the learning behavior
is 1e-3. In this case, the AIC score is 0.89 which is better than the AIC score
for ensemble and the other classifiers above.
Ensemble and logistic regression
are the best for predicting customer satisfaction ratings.
5.
Section 5: Problem
Conclusions and Recommendations
Many factors contribute to customer product review score and
the business need to consider each. Businesses need to incorporate latest
technological advances into their services, to improve the chances of repeat
customers by ensuring good rating thus probability of recommendations. From the
study, there are many indicators of the repeat customer and the most important
are the picture quality and the length of the title of the good that ranks high
in the regression models (Parsons, Costa, Achten
& Stallard, 2009). It is
important therefore to leverage the current technology that perform photo
processing to ensure the highest image quality, which will ensure that are good
customer review. This means that brands need to make strategic decisions on
product photography, target low traffic cities and cities for growth
opportunities, there is also need to create special landing pages for the key
categories that aggregate most sales. Other indicators such as the quantity of
products and the total price, purchase day of the week are also important in
customer rating, this means that the company should take advantage of days like
Black Friday to gain more customers.
References
Parsons, N. R., Costa, M.
L., Achten, J., & Stallard, N. (2009). Repeated measures proportional odds
logistic regression analysis of ordinal score data in the statistical software
package R. Computational Statistics & Data Analysis, 53(3),
632-641.
Souza, F. (2021). Sentiment
Analysis on Brazilian Portuguese User Reviews. arXiv preprint
arXiv:2112.05459.
Verma, A., Kuo, Y. H.,
Kumar, M. M., Pratap, S., & Chen, V. (2022). A data analytic-based
logistics modelling framework for E-commerce enterprise. Enterprise
Information Systems, 1-23.
Facundo, J. G. (2021).
Customer Churn-Prevention Model–Unsupervised Classification.
Patel, M. (2020). Exploratory
Data Analysis and Sentiment Analysis on Brazilian E-Commerce Website (Doctoral
dissertation).
Matuszelański, K., &
Kopczewska, K. (2022). Customer Churn in Retail E-Commerce Business: Spatial
and Machine Learning Approach. Journal of Theoretical and Applied
Electronic Commerce Research, 17(1), 165-198.
de Jong, T. (2021).
Brazilian Marketplace: How Does the Customer Experience Influence the Review
Valence?.
Klimantaviciute, G. (2021).
Customer Churn Prediction in E-Commerce Industry. Journal of Machine
Learning Research, 1, 1-14.
Wong, A. N., &
Marikannan, B. P. (2020, December). Optimising e-commerce customer satisfaction
with machine learning. In Journal of Physics: Conference Series (Vol.
1712, No. 1, p. 012044). IOP Publishing.
Wong, A. N., &
Marikannan, B. P. (2020, December). Optimising e-commerce customer satisfaction
with machine learning. In Journal of Physics: Conference Series (Vol.
1712, No. 1, p. 012044). IOP Publishing.
Eldik, M. V. (2021). The
Relationship of Geodemographics and Written Online Customer Review Creation:
Insights from Brazil (Doctoral dissertation).
Al-Basha, F. (2021). Forecasting
Retail Sales Using Google Trends and Machine Learning (Doctoral dissertation,
HEC Montréal).
Menard, S. (2000).
Coefficients of determination for multiple logistic regression analysis. The
American Statistician, 54(1), 17-24.
Field, A. (2009). Logistic
regression. Discovering statistics using SPSS, 264,
315.
Kwak, C., & Clayton-Matthews,
A. (2002). Multinomial logistic regression. Nursing Research, 51(6),
404-410.
Appendix 1: R-CODE
library(dplyr)
library(forecast)
library(leaps)
library(tidyverse)
library(caret)
library(corrplot)
library(nnet)
library(MASS)
library(randomForest)
library(lubridate)
library(pROC)
library(gains)
library(readr)
####Loading datsets####
orders <-
read.csv("C:/data1/olist/olist_orders_dataset.csv",header =
T,stringsAsFactors = F)
customers <-
read.csv("C:/data1/olist/olist_customers_dataset.csv",header = T)
order_reviews <-
read.csv("C:/data1/olist/olist_order_reviews_dataset.csv",header = T)
order_payments <-
read.csv("C:/data1/olist/olist_order_payments_dataset.csv",header =
T)
order_items_details <-
read.csv("C:/data1/olist/olist_order_items_dataset.csv",header = T)
sellers <-
read.csv("C:/data1/olist/olist_sellers_dataset.csv",header = T)
geolocation <-
read.csv("C:/data1/olist/olist_geolocation_dataset.csv",header = T)
products <-
read.csv("C:/data1/olist/olist_products_dataset.csv",header = T)
nrow(orders)
####Data preparation####
###Merging datasets###
orders <-
orders[orders$order_status=="delivered",]
orders_all <-
merge(orders,customers,by="customer_id",all.x=T)
orders_all$order_purchase_timestamp <-
as.POSIXct(orders_all$order_purchase_timestamp,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all$order_delivered_customer_date<-
as.POSIXct(orders_all$order_delivered_customer_date,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all$order_approved_at<-
as.POSIXct(orders_all$order_approved_at,"%Y-%m-%d %H:%M:",tz="GMT")
orders_all$order_estimated_delivery_date<-
as.POSIXct(orders_all$order_estimated_delivery_date,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all <-
merge(orders_all,order_reviews,by="order_id")
orders_all <- orders_all[!duplicated(orders_all$order_id),]
orders_all <-
merge(orders_all,order_payments,by="order_id",all.x = T)
orders_all <-
orders_all[!duplicated(orders_all$order_id),]
orders_all <-
merge(orders_all,order_items_details,by="order_id",all.X=T)
orders_all <- orders_all[!duplicated(orders_all$order_id),]
orders_all <-
merge(orders_all,sellers,by="seller_id",all.X=T)
orders_all <-
merge(orders_all,products,by="product_id",all.X=T)
rm(list=setdiff(ls(), "orders_all"))
setwd("C:/data1/olist/")
write.csv(orders_all,file="Orders_merged.csv",row.names
= F)
head(orders_all)
####Loading the combined dataset####
setwd("C:/data1/olist/")
orders_all <-
read.csv("Orders_merged.csv",header = T,stringsAsFactors = F)
orders_all$order_purchase_timestamp <-
as.POSIXct(orders_all$order_purchase_timestamp,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all$order_delivered_customer_date<-
as.POSIXct(orders_all$order_delivered_customer_date,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all$order_approved_at<-
as.POSIXct(orders_all$order_approved_at,"%Y-%m-%d
%H:%M:",tz="GMT")
orders_all$order_estimated_delivery_date<-
#as.POSIXct(orders_all$order_estimated_delivery_date,"%Y-%m-%d",tz="GMT")
as.Date(orders_all$order_estimated_delivery_date, format =
"%Y-%m-%d")
orders_all$order_delivered_month <-
format(as.Date(orders_all$order_estimated_delivery_date),"%Y-%m")
orders_all$del_time <-
difftime(orders_all$order_delivered_customer_date,
orders_all$order_purchase_timestamp,
units="days")
na.omit(orders_all)
orders_all$del_time_bucket <- ifelse(orders_all$del_time
< 5,"<5",
ifelse(orders_all$del_time<10,"5-10",
ifelse(orders_all$del_time < 20,"10-20",
ifelse(orders_all$del_time<40,"20-40",">40"))))
nrow(orders_all)
####Summarisation and data exploration####
no_of_orders_month <- orders_all %>%
group_by(order_delivered_month) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_location <- orders_all %>%
group_by(customer_state) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_location_month <- orders_all %>%
group_by(customer_state,order_delivered_month) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_rs <-orders_all %>%
group_by(review_score,order_delivered_month) %>%
summarise(no=n())
no_of_orders_rs_p <-orders_all %>%
group_by(review_score,product_category_name) %>%
summarise(no=n())
no_of_orders_pt <- orders_all %>%
group_by(payment_type) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_seller <- orders_all %>%
group_by(seller_id) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_seller <-
no_of_orders_seller[order(-no_of_orders_seller$rev_in_K),]
no_of_orders_seller_1 <-
no_of_orders_seller[order(-no_of_orders_seller$no),]
no_of_orders_photos <- orders_all %>% group_by(product_photos_qty)
%>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_pcm <- orders_all %>%
group_by(product_category_name) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
no_of_orders_nl <- orders_all %>%
group_by(product_name_lenght) %>%
summarise(no=n(),rev_in_K=sum(price)/1000,del_time=mean(del_time))
####Data preparation for prediction####
###Selecting varaibles and dropping columns###
drops <-
c("order_status","order_delivered_carrier_date","order_purchase_timestamp",
"order_approved_at","customer_zip_code_prefix","customer_unique_id",
"customer_city","review_id","review_comment_title","review_comment_message",
"review_creation_date","review_answer_timestamp","order_item_id","seller_zip_code_prefix",
"seller_city","product_category_name","order_delivered_month","product_id","seller_id",
"order_id","customer_id","order_delivered_customer_date","order_estimated_delivery_date",
"customer_state","seller_state","payment_sequential","payment_value","del_time_bucket","payment_type","product_weight_g",
"product_height_cm",
"shipping_limit_date","product_length_cm","product_width_cm","payment_installments")
orders_all_1 <- orders_all[ , !(names(orders_all) %in%
drops)]
head(orders_all_1)
#####Creating partition sets####
##orders_all_1$del_time <-
as.numeric(orders_all_1$del_time)
orders_all_1$del_time <-
as.numeric(orders_all_1$del_time)
set.seed(13)
train.index <- createDataPartition(orders_all_1$review_score,
p = 0.8, list = FALSE)
train.df <- orders_all_1[train.index, ]
valid.df <- orders_all_1[-train.index, ]
orders_all_1[is.na(orders_all_1)] <- 0
cor_mat <- cor(orders_all_1)
corrplot(cor_mat, method="color")
###Red tells us neg correl and blue gives us pos correl
,intentisty of color gives us the strenght of correl
##We decide that none of the variables are correlated
strongly##
####Linear regression####
lm_1 <- lm(review_score~.,data=train.df)
summary(lm_1)
##We didnt use correl matrix because there were categorical
variables and correl matrix causes multi colinearity##
####Reg search####
search <- regsubsets(review_score ~ ., data = train.df,
nbest = 1, nvmax = dim(orders_all_1)[2],
method = "exhaustive")
sum <- summary(search)##Giving us that 7 predictors are
the best model##
a <- predict(lm_1,valid.df)###Predicting on the
validation dataset##
accuracy(a, valid.df$review_score)####RMSE too high,not so
good model
#####since direct variables were of less correlation,we will
try some derivative variables and try running to see if there's an improvememnt
####
orders_all_2 <- orders_all
head(orders_all_2)
orders_all_2$est_del_time<-
difftime(orders_all$order_estimated_delivery_date,
orders_all$order_approved_at,
units="days")
orders_all_2$delta_time<-
orders_all_2$est_del_time-orders_all_2$del_time
orders_all_2$Late <-
ifelse(orders_all_2$delta_time<0,1,0)
orders_all_2$total_price <-
orders_all_2$price+orders_all_2$freight_value
orders_all_2$freight_ratio <-
orders_all_2$freight_value/orders_all_2$price
orders_all_2$purchase_day_of_week <-
wday(orders_all_2$order_approved_at)
orders_all_2 <- orders_all_2[ , !(names(orders_all_2)
%in% drops)]
orders_all_2$del_time <-
as.numeric(orders_all_2$del_time)
orders_all_2$est_del_time <-
as.numeric(orders_all_2$est_del_time)
orders_all_2$delta_time <-
as.numeric(orders_all_2$delta_time)
orders_all_2$order_delivered_month <- NULL
orders_all_2$del_time <- NULL
orders_all_2$delta_time <- NULL
orders_all_2$Late <- NULL
orders_all_2$est_del_time <- NULL
orders_all_2[is.na(orders_all_2)] <- 0
orders_all_2 <- orders_all_2[ ,
-which(names(orders_all_2) %in% c("price"))]
cor_mat_2 <- cor(orders_all_2)
str(orders_all_2)
corrplot(cor_mat_2, method="color")
#####Partition the dataset########Using the new derived
metrics####
set.seed(13)
train.index_1 <-
createDataPartition(orders_all_2$review_score, p = 0.8, list = FALSE)
train_1.df <- orders_all_2[train.index_1, ]
valid_1.df <- orders_all_2[-train.index_1, ]
train_1.df$review_score =as.numeric(
train_1.df$review_score)
lm_2 <- lm(review_score~.,data=train_1.df,na.action =
na.omit)
summary(lm_2)
a_1 <- (predict(lm_2,valid_1.df))###Predicting on the
validation dataset##
accuracy(a_1, valid_1.df$review_score)####RMSE too high,not
so good model###
str(valid_1.df)
#####Low Med High####
orders_all_2$review_score_1 <-
ifelse(orders_all_2$review_score<3,0,1)
orders_all_2$review_score <-
orders_all_2$review_score_1
####Next model####--Logistic model--####
set.seed(13)
train.index_1 <-
createDataPartition(orders_all_2$review_score, p = 0.8, list = FALSE)
train_1.df <- orders_all_2[train.index_1, ]
valid_1.df <- orders_all_2[-train.index_1, ]
orders_all_2$review_score <-
(as.factor(orders_all_2$review_score))###Leveling by giving reference###
logit.reg <- glm(review_score~., data = train_1.df)
summary(logit.reg)
###fitting the model and checking accuracy on the training
model using a confusion matrix##
logit.reg.pred <- as.data.frame(predict(logit.reg,
valid_1.df[, -1], type = "response"))
colnames(logit.reg.pred)[1] <- "p"
logit.reg.pred$class <-
ifelse(logit.reg.pred$p>0.5,1,0)
cm <- table(logit.reg.pred$class,valid_1.df$review_score)
caret::confusionMatrix(cm)
########Next model -linear discriminant analysis#####
#Running linear discriminant analysis
lda <- lda(review_score~.,data=train_1.df)
x= "review_score"
lda <- lda(as.formula(paste(x, ".", sep =
"~")), train_1.df)
###Checking model on test dataset##
str(valid_1.df)
pred_lda <- predict(lda, valid_1.df)
###Checking accuracy using a confusion matrix##Accuracy is
again found to be 80%##
cm_lda <- (table(pred_lda$class,
valid_1.df$review_score))
caret::confusionMatrix(cm_lda)
####Next model-random forest ####
##First convert the dependant to factors##
valid_1.df$review_score <-
as.factor(valid_1.df$review_score)
train_1.df$review_score <-
as.factor(train_1.df$review_score)
##Running randomForest on training##
rfm <- randomForest(review_score~., train_1.df)
###Prediting on the validation dataset and checking for
accuracy###---90%
pred_rfm <-(predict(rfm,valid_1.df))
pred_rfm_p <- as.data.frame(predict(rfm, valid_1.df, type
= "prob"))
cm_rfm <- (table(pred_rfm, valid_1.df$review_score))
caret::confusionMatrix(cm_rfm)
####Out of all the models random forest gives us the best
result so we choose this to explain the model####
##PLotting lift ####
valid_1.df$review_score <-
as.numeric(valid_1.df$review_score)
p <- pred_lda$posterior[,2]
gain <- gains(valid_1.df$review_score,p)
plot(c(0,gain$cume.pct.of.total*sum(valid_1.df$review_score))~c(0,gain$cume.obs),
xlab = "#
cases", ylab = "classified", main = "Lift-chart", type
= "l")
lines(c(0,sum(valid_1.df$review_score))~c(0,
dim(valid_1.df)[1]), lty = 5,col="blue1")
###Plotting roc curve for random forest###
r <- roc(valid_1.df$review_score,pred_rfm_p$`1`)
plot.roc(r)
auc(r)
## ensemble models
## stacking the models
set.seed(100)
control_stacking <- trainControl(method="repeatedcv",
number=2, repeats=2, savePredictions=TRUE, classProbs=TRUE)
algorithms_to_use <- c('lm', 'glm', 'lm')
stacked_models <- caretList(review_score ~.,
data=orders_all_1, trControl=control_stacking, methodList=algorithms_to_use)
stacking_results <- resamples(stacked_models)
summary(stacking_results)
#combine the predictions
# stack using ensemble
stackControl <-
trainControl(method="repeatedcv", number=3, repeats=3,
savePredictions=TRUE, classProbs=TRUE)
set.seed(234)
glm_stack <- caretStack(stacked_models, method="lm",
metric="Accuracy", trControl=stackControl)
print(glm_stack)