Name	Data type	Constraints	Value
Id	integer	primary key, not null	1
Name	text		'Babe'
Breed	text		'Quarter horse'
Height	double		15.3
BirthDate	text		'2015-02-10'

Name	Data type	Constraints	Value
Id	integer	primary key, not null	1
Name	text		'Babe'
Breed	text		'Quarter horse'
Height	double		15.3
BirthDate	text		'2015-02-10'

Repeat Purchase in a competitive E-Commerce

Name

Institution

1. Section 1: The Problem (10%) 900

a) Discuss the problem you are addressing.

Customer recommendations and rating of a product is important in the any industry, especially where populations and businesses scramble satisfy customers, there is therefore a need to measure the extent of customer satisfaction and this can be done using machine learning paradigm. Most companies allow consumers freedom of choice to choose from a wide range of products but they could lose when they do not analyze their satisfaction levels and this was determined by the product review score, which determined how likely they would recommend the product to another person (Souza, 2021). The product review score data analysis is necessary in forecast, it is important in determining future purchases and how the company can adjust to the changing business environment. There are many reasons for analyzing review score, such as neglecting a given market niche, and when a purchase behavior cannot be explained, it means that the company is experiencing a purchase problem that needs to be addressed (Facundo, 2021).

Customer review is done to determine how it can be reduced. The process uses data intelligence on review score, many other insights can be useful from the data analysis. Analyzing the purchase data helps businesses to track other Key Performance Indicator and also establish customer behavior, and also increase customer touchpoints. Analyzing customer behavior is also important in a given market niche (Patel, 2020). Assessing the products through customer review analysis also helps to maintain the current customers and also adds to business bottom line operations.

b) Importance of Customer Purchase Analysis

Every business takes advantage of analyzing their annual purchase behavior and they do this to reduce the purchase by identifying the factors that lead to customer purchase behavior. The analysis helps the business address the root problems that affect customer recommendations and rating. The other advantages include reducing the cost of customer acquisition (Matuszelański & Kopczewska, 2022).

Expanding the business base through customer acquisition by recommendation important for any company, on the other hand, there is a need to retain the existing customers, when this is not done, it leads to a high customer acquisition cost. The customer service department has a lot to do in terms of retaining customers, when the current customer base is satisfied with the products, it will lead to less costs of new customer acquisition (Kwak & Clayton-Matthews, 2002). The main aim of the analysis is to identify variables are highly correlated to customer review, such as switching to a competitor, lack of subscription renewal or just closing an account. The analysis also determines different purchase indicators such as unpredictable purchase spikes, price changes that may cause more purchases, or when the sales department is struggling to sell, which may lead to increased customer purchases. Other factors include changes in the market, when there is a hidden problem in the business that makes customers rate products poorly (Kwak & Clayton-Matthews, 2002).

c) What are the questions and business/management decisions your analysis is trying to address?

The question the analysis addresses how different variables such as price, freight value, product name length, description, quality of product photos, and delivery time (de Jong, T. (2021). When the variables that contribute highly to customer’s product review identified, then the department that deals with the specific behavior can agree on the strategic steps on how to influence future purchase behavior, in the long run, the purchase will reveal specific solutions that will be important in the business bottom line (de Jong, 2021).

d) Describe your problem’s decision-maker and what is important for them to know from your data analysis?

The analysis needs to give the strategic marketing department the most significant indicator of review score, whether it is gender, taste and preferences or changes in the general behavior of the market, the marketers needs to take action on the best way to ensure customer retention and recommendation, such as through improved communication, shortening the product name length, and improving product picture quality (Klimantaviciute, 2021).

e) Discuss the source of your data. Questions to consider include:

The data has been taken from a central database Kaggle.com, Since the data source is from a published source, and a renowned repository, the data source is reliable and can be used in decision making. The data is 3 years old since it was recorded in 2019. The data was recorded to observe the purchasing behavior during autumn the same year (Wong & Marikannan, 2020, December).

f) Identification and justification of your choice of the target attribute

The chosen attributes that have been used in the study are all the categorical variables and they were selected depending on their correlation to the customer product rating (review score) (Kwak & Clayton-Matthews, 2002). The dependent variable was customer review score. The variable was studied using the independent variables price, freight value, product name length, description, quality of product photos, and delivery time (Kwak & Clayton-Matthews, 2002). All these factors are important, for example, if online security leads to more customer purchases, then it means that the cyber security department should have policies that will lower customer purchases. When it is paperless billing that makes customers lose touch with the current client base, then the business can think of ways of bringing customers closer to the business (Eldik, 2021). Poor internet service may also make customers move to a rival business.

2. Section 2: Understand the Data

a) Discuss the nature and size of the dataset(s) you are using.

The data size used has over 90,000 observations, this can be shown by r-code nrow shows an outcome of 95,000 observations. The data is from the Brazilian public orders that were made at Olist Store. The dataset has information from 2016 to 2019 at multiple market-places in Brazil, the data allows viewing of orders from multiple dimensions, such as order status, payment, price, and payment, and geolocation of the data set (Al-Basha, 2021). The dataset has information about the customer and their location, the orders can be used in finding the orders delivered and location where these orders have bee delivered. Every order is assigned unique customer ID, and this means that customers will be assigned different ID for every order, the unique id is important in identifying unique customer that made an order to the store (Al-Basha, 2021).

The discussion is about what drives the customers to change their behavior. Multiple regression approach was used to determine most significant factor. The variables as stated above were review score, price, freight value, product name length, description, quality of product photos, and delivery time (de Jong, 2021).

b) Discuss the data attributes that are relevant to your problem.

The given is a five-star rating system that summarizes the overall satisfaction of the customers with the products that they purchase, the data can be converted to a binary classification by treating a 5-star rating as a positive and the rest as negative. The data has multiple tables, each having its own categorical and also numerical columns. After analysis, it is found that the five-star rating accounts for 76% of the total data sets, while the rest have negative ratings. When looking at the best category for sales, then furniture, décor, and beauty products are the most popular categories.

c) Tableau visualizations

Customer locations by region

Tableau graph shows the customers by region and majority of the customers are from Brazil with Majority From Sao Paulo City.

Top Ordered Categories that generate more money

The company needs to understand the top ordered category since this is the category that is the cash cow for the company.

Importance of Pictures to customer

• R-generated behavior plots or aggregation tables

The highest category is SP while the CE is the least bought category of products.

d) Product Score reviews

The diagram above is the correlation matrix, from the diagram, blue shows the perfect correlation and red shows poor correlation. From the figure, it is observed that price review and delivery time are poorly correlated. Freight score is however highly correlated to the review score. When the data was cleansed and correlation matrix built again, the following correlation matrix was the result. There is a high correlation between the freight value and price, the least correlation is between delivery time and review score. The company is therefore doing well when it comes to delivery times.

The model shows high correlation between the total price and the freight value.

3. Section 3: Prepare the Data

a) Derivation of attributes

The analysis was based on the correlation between all important variables such as review core and the freight value, the length of the product name, photo quality, freight ratio, total price, and the freight ratio. R analysis also gave the top product categories, overall the company is evaluating the cause-effect analysis by finding out why some products are ranked higher by customers than other products.

b) Discuss and justify what other steps you may have taken to prepare your data

From the raw data, there is a possibility that there are rows that may be repeated, the rows are therefore redundant and have to be removed, this was done by dropping some rows that may not be used in the observation, the following R code was used in dropping the observations;

drops <- c("order_status","order_delivered_carrier_date","order_purchase_timestamp", "order_approved_at","customer_zip_code_prefix","customer_unique_id","customer_city","review_id","review_comment_title","review_comment_message",v"review_creation_date","review_answer_timestamp","order_item_id","seller_zip_code_prefix",

"seller_city","product_category_name","order_delivered_month","product_id","seller_id",order_id","customer_id","order_delivered_customer_date","order_estimated_delivery_date", "customer_state","seller_state","payment_sequential","payment_value","del_time_bucket","payment_type","product_weight_g","product_height_cm", "shipping_limit_date","product_length_cm","product_width_cm","payment_installments")orders_all_1 <- orders_all[ , !(names(orders_all) %in% drops)]

The columns and the rows were dropped because customers cannot post the same order twice, for the same product at the same time. From the data, it is also important to have dates be in a standard format, it is also important to differentiate date parameters such as the day, year, and time to use every parameter differently in the computations (Field, 2009). For categorical data, the mode of the columns was used, and all missing data were dropped during the analysis. Since the research also uses a binary classification task, there is a need to create a new column, that contains the labels, the columns used in the analysis also consist of numerical data, and using descriptive with these data is the easiest way to understand the data (de Jong, 2021).

c) "Training", “Validation” (if required), and "Testing" subsets of the dataset.

The training and test data were prepared as follows;

The data for modeling was divided as follows;

#####Low Med High####

orders_all_2$review_score_1 <- ifelse(orders_all_2$review_score<3,0,1)

orders_all_2$review_score <- orders_all_2$review_score_1

####Next model####--Logistic model--####

set.seed(13)

train.index_1 <- createDataPartition(orders_all_2$review_score, p = 0.8, list = FALSE)

train_1.df <- orders_all_2[train.index_1, ]

valid_1.df <- orders_all_2[-train.index_1, ]

In this code, training1_df is the training data set, while validation data is the testing dataset.

The createDataPartition method available in R was used in partitioning the data, with the train taking 80% while test, portioned 20%.

4. Section 4: Review, behavior and Test Prediction Models (40%)

a) Configuration of the models

The models have configured models (e.g. select attributes and/or other model parameters) that are expected to best deliver relevant insights and/or provide the lowest error behaviors, the attributes included freight value, product photo quantity, total price, and the day of the week of the product purchase. The models chosen were the simplest that could give an insight to the data. The models used included multiple regression linear model, that was used to computer the correlation coefficients between the dependent variable customer review and the independent variables total price, purchase and total price.

b) The analysis used 3 models;

Linear Model: the linear model was used to fit the review score versus other attributes such as freight value, product photo quantity, total price, and day of purchase of the product. Another factor used as a predictor is the product description length; the following is the result of the multiple linear regression.

This model calculated the correlation between freight value, product name length, product description length, product photo quality, total price, freight ratio and purchase day of the week, negative correlation were found in the freight value, product name length, purchase day of week and freight ratio. The model shows that the most significant predictor is total price, followed by product name length. Picture quality also plays an important factor.

From the data above, the r-squared value is approaching 1, this shows that the model fits the data, and the information given by the linear model can be relied upon. The p-value of the intercept is less than alpha, and this shows that we reject the null hypothesis and conclude that the parameters are significant to the customer rating. When looking at the coefficients, the total price, the purchase day of the week, and product name length are all predictors of the customer rating. Almost all the variables have the same significance to the customer score. The total price however is the most significant predictor of the customer review score.

Accuracy of the linear model was tested and the following is the results,

The p-value is zero and this means that the null hypothesis is rejected, therefore there is a significant correlation between the variables. AIC and BIC values are large, and this shows that the model does not all the values and cannot be relied upon. The data was then configured, calculated variables were used in the next model, variables such as time to deliver, was calculated from the difference between order time and delivery time, the new variable total price was calculated from price added to the freight price. Another calculated field freight ratio was calculated from fields freight value divided by order price. Then other columns that were not used in the analysis such as delivery month, delivery time and estimated delivery time were deleted from the dataset. The calculated fields were then used in the second linear model, the model calculated the correlation of the review score and the calculated fields such as time to deliver, freight ratio and freight price. The result was as follows;

From the result, the intercept of 4.27 is statistically significant, and the total that has a positive correlation with the customer review has the highest intercept, hence the most significant predictor. According to the model, this is followed by product description length that also has a positive correlation to the customer review. There are variables with positive correlation with the review, they include total price, product description length, product name length and total price.

Accuracy of the model is as shown below;

From the R-Squared value, the model is more accurate since it has smaller value when compared to the first linear model. Adjusted r-squared for the model is 0.000223, sigma of 1.28 and AIC of 25606. The model scores better in accuracy than the previous linear model.

Logistic regression: The logistic regression was also used to fit the data and the following was the output of the results; Number of Fisher Scoring iterations: 1. The regression analysis shows that product photo quality is the best predictor. Followed by price. The AIC score however shows high values, showing that logistic regression does not fit the data like the linear regression above. The result of the logistic regression is as shown below;

The model also confirms the fact the total price is the highest predictor.

c) Ensemble Model

The ensemble model used is 10 fold logistic regression with 3 iterations, the model combines the linear model ad multiple linear regression.

algorithms_to_use <- c('glm', lm, ‘lm’, 'svmRadial')

stacked_models <- caretList(review_score ~., data=orders_all_1, trControl=control_stacking, methodList=algorithms_to_use)

stacking_results <- resamples(stacked_models)

The scoring rule to measure the accuracy of the model is the use of the AIC score, from the results, the ensemble had the lowest AIC. The ensemble model returned better scores and showed that price was the most

Discuss what steps you may have taken to improve your individual models.

d) Improving the models

To improve the Ensemble Model algorithm, the weighted k-nearest neighbors were applied for the ensemble model fit, and performance is said to be at best when the k=3, despite the improvement, ensemble was not a good classifier for the data. Initially, the model was set to k = 2. By fitting the knn with a k=3, the model was improved as shown in the code section. The score for ensemble was 0.856 and this is low when compared to other classifiers. The account of false-positive points is higher than the count of false-negative and this is not good for classification (Menard, 2000). The code below was used to improve the model;

stackControl <- trainControl(method="repeatedcv", number=3, repeats=3, savePredictions=TRUE, classProbs=TRUE)

As it can be seen in the above code, the number of k was increased to 3 from the previous 2 in order to improve the model accuracy.

Additionally, when it comes to logistic regression, there is what is called an inverse regularization constant whose values are the power of 10, the model performs well when the C value is 10e-2. After validation, the F1 score for the classifier is 0.89 which is better than the ensemble, the count for true negatives is however lesser than the count of false positives. The last of the algorithms are random forest classifier. The hyperparameters have been applied in the count of the base estimators, which are decision trees for this case, six values are chosen between 50 and 210 as the number of estimators, Where the limits are inclusive set. The score for this validation is 0.89 and this is a good validator.

e) Improving the algorithms using XGBoost

The hyper-parameters, used in this case are decision trees, in this case, the values are chosen between 50 and 210, and the number of estimators is limited as mentioned in the ensemble. The AIC score for XGBoost is 0.89. Another way of boosting Multi-layered Perception, is now there are multi-layered perceptions for classification, the optimizer used in this case is Adam and the learning behavior is 1e-3. In this case, the AIC score is 0.89 which is better than the AIC score for ensemble and the other classifiers above. Ensemble and logistic regression are the best for predicting customer satisfaction ratings.

5. Section 5: Problem Conclusions and Recommendations

Many factors contribute to customer product review score and the business need to consider each. Businesses need to incorporate latest technological advances into their services, to improve the chances of repeat customers by ensuring good rating thus probability of recommendations. From the study, there are many indicators of the repeat customer and the most important are the picture quality and the length of the title of the good that ranks high in the regression models (Parsons, Costa, Achten & Stallard, 2009). It is important therefore to leverage the current technology that perform photo processing to ensure the highest image quality, which will ensure that are good customer review. This means that brands need to make strategic decisions on product photography, target low traffic cities and cities for growth opportunities, there is also need to create special landing pages for the key categories that aggregate most sales. Other indicators such as the quantity of products and the total price, purchase day of the week are also important in customer rating, this means that the company should take advantage of days like Black Friday to gain more customers.

References

Parsons, N. R., Costa, M. L., Achten, J., & Stallard, N. (2009). Repeated measures proportional odds logistic regression analysis of ordinal score data in the statistical software package R. Computational Statistics & Data Analysis, 53(3), 632-641.

Souza, F. (2021). Sentiment Analysis on Brazilian Portuguese User Reviews. arXiv preprint arXiv:2112.05459.

Verma, A., Kuo, Y. H., Kumar, M. M., Pratap, S., & Chen, V. (2022). A data analytic-based logistics modelling framework for E-commerce enterprise. Enterprise Information Systems, 1-23.

Facundo, J. G. (2021). Customer Churn-Prevention Model–Unsupervised Classification.

Patel, M. (2020). Exploratory Data Analysis and Sentiment Analysis on Brazilian E-Commerce Website (Doctoral dissertation).

Matuszelański, K., & Kopczewska, K. (2022). Customer Churn in Retail E-Commerce Business: Spatial and Machine Learning Approach. Journal of Theoretical and Applied Electronic Commerce Research, 17(1), 165-198.

de Jong, T. (2021). Brazilian Marketplace: How Does the Customer Experience Influence the Review Valence?.

Klimantaviciute, G. (2021). Customer Churn Prediction in E-Commerce Industry. Journal of Machine Learning Research, 1, 1-14.

Wong, A. N., & Marikannan, B. P. (2020, December). Optimising e-commerce customer satisfaction with machine learning. In Journal of Physics: Conference Series (Vol. 1712, No. 1, p. 012044). IOP Publishing.

Eldik, M. V. (2021). The Relationship of Geodemographics and Written Online Customer Review Creation: Insights from Brazil (Doctoral dissertation).

Al-Basha, F. (2021). Forecasting Retail Sales Using Google Trends and Machine Learning (Doctoral dissertation, HEC Montréal).

Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician, 54(1), 17-24.

Field, A. (2009). Logistic regression. Discovering statistics using SPSS, 264, 315.

Kwak, C., & Clayton-Matthews, A. (2002). Multinomial logistic regression. Nursing Research, 51(6), 404-410.

Appendix 1: R-CODE

library(dplyr)

library(forecast)

library(leaps)

library(tidyverse)

library(caret)

library(corrplot)

library(nnet)

library(MASS)

library(randomForest)

library(lubridate)

library(pROC)

library(gains)

library(readr)

####Loading datsets####

orders <- read.csv("C:/data1/olist/olist_orders_dataset.csv",header = T,stringsAsFactors = F)

customers <- read.csv("C:/data1/olist/olist_customers_dataset.csv",header = T)

order_reviews <- read.csv("C:/data1/olist/olist_order_reviews_dataset.csv",header = T)

order_payments <- read.csv("C:/data1/olist/olist_order_payments_dataset.csv",header = T)

order_items_details <- read.csv("C:/data1/olist/olist_order_items_dataset.csv",header = T)

sellers <- read.csv("C:/data1/olist/olist_sellers_dataset.csv",header = T)

geolocation <- read.csv("C:/data1/olist/olist_geolocation_dataset.csv",header = T)

products <- read.csv("C:/data1/olist/olist_products_dataset.csv",header = T)

nrow(orders)

####Data preparation####

###Merging datasets###

orders <- orders[orders$order_status=="delivered",]

orders_all <- merge(orders,customers,by="customer_id",all.x=T)

orders_all$order_purchase_timestamp <-

as.POSIXct(orders_all$order_purchase_timestamp,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_delivered_customer_date<-

as.POSIXct(orders_all$order_delivered_customer_date,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_approved_at<-

as.POSIXct(orders_all$order_approved_at,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_estimated_delivery_date<-

as.POSIXct(orders_all$order_estimated_delivery_date,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all <- merge(orders_all,order_reviews,by="order_id")

orders_all <- orders_all[!duplicated(orders_all$order_id),]

orders_all <- merge(orders_all,order_payments,by="order_id",all.x = T)

orders_all <- orders_all[!duplicated(orders_all$order_id),]

orders_all <- merge(orders_all,order_items_details,by="order_id",all.X=T)

orders_all <- orders_all[!duplicated(orders_all$order_id),]

orders_all <- merge(orders_all,sellers,by="seller_id",all.X=T)

orders_all <- merge(orders_all,products,by="product_id",all.X=T)

rm(list=setdiff(ls(), "orders_all"))

setwd("C:/data1/olist/")

write.csv(orders_all,file="Orders_merged.csv",row.names = F)

head(orders_all)

####Loading the combined dataset####

setwd("C:/data1/olist/")

orders_all <- read.csv("Orders_merged.csv",header = T,stringsAsFactors = F)

orders_all$order_purchase_timestamp <-

as.POSIXct(orders_all$order_purchase_timestamp,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_delivered_customer_date<-

as.POSIXct(orders_all$order_delivered_customer_date,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_approved_at<-

as.POSIXct(orders_all$order_approved_at,"%Y-%m-%d %H:%M:",tz="GMT")

orders_all$order_estimated_delivery_date<-

#as.POSIXct(orders_all$order_estimated_delivery_date,"%Y-%m-%d",tz="GMT")

as.Date(orders_all$order_estimated_delivery_date, format = "%Y-%m-%d")

orders_all$order_delivered_month <-

format(as.Date(orders_all$order_estimated_delivery_date),"%Y-%m")

orders_all$del_time <- difftime(orders_all$order_delivered_customer_date,

orders_all$order_purchase_timestamp,

units="days")

na.omit(orders_all)

orders_all$del_time_bucket <- ifelse(orders_all$del_time < 5,"<5",

ifelse(orders_all$del_time<10,"5-10",

ifelse(orders_all$del_time < 20,"10-20",

ifelse(orders_all$del_time<40,"20-40",">40"))))

nrow(orders_all)

####Summarisation and data exploration####

no_of_orders_month <- orders_all %>% group_by(order_delivered_month) %>%