Project 3 Quantium Retail Strategy and Analytics

Code: GitHub

3.1 Introduction

This work is part of the virtual internship program of the company Quantium.

Quantium is a Data Science company that helps companies with insights and models looking to improve their performance.

Quantium has had a data partnership with a major supermarket brand in recent years, which provides transactional and customer data.

The analysis is based on Quantium chip data to better understand the types of customers who buy chips and their buying behavior in the region.

The analysis insights serve to feed the supermarket’s strategic plan for the chip category.

3.2 Data

The first step in any analysis is to first understand the data. Let’s take a look at each of the datasets provided.

3.3 Exploring the Data

Starting with the analysis of the dataset transaction, let’s look at the first few lines.

3.3.1 Examining transaction data

DATE STORE LYLTY_CARD TXN_ID PROD NAME QTY TOT_SALES
43390 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6
43599 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
43605 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
43329 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15
43330 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8

As we are only interested in words that tell us if the product is chips or not, let’s remove all words with digits and special characters such as ‘&’ from our set of product words.

There are salsa products in the dataset but we are only interested in the chips category, so let’s remove these.

Next, we can check summary statistics such as mean, min and max values for each feature to see if there are any obvious outliers in the data and if there are any nulls in any of the columns

STORE QTY TOT_SALES PROD_SIZE
count 246740 246740 246740 246740
mean 135.05 1.90646 7.31611 175.584
std 76.787 0.342499 2.4749 59.4321
min 1 1 1.7 70
25% 70 2 5.8 150
50% 130 2 7.4 170
75% 203 2 8.8 175
max 272 5 29.5 380

There are no nulls in the columns but product quantity appears to have an outlier which we should investigate further. Let’s investigate further the case where 200 packets of chips are bought in one transactions.

DATE STORE LYLTY_CARD TXN_ID PROD PROD_NAME QTY TOT_SALES PROD_SIZE
2018-08-19 226 226000 226201 4 dorito corn chp supreme 200 650 380
2019-05-20 226 226000 226210 4 dorito corn chp supreme 200 650 380

There are two transactions where 200 packets of chips are bought in one transaction and both of these transactions were by the same customer. Let’s see if the customer has had other transactions

DATE STORE LYLTY_CARD TXN_ID PROD NAME PROD_QTY TOT_SALES PROD_SIZE
2018-08-19 226 226000 226201 4 dorito corn chp supreme 200 650 380
2019-05-20 226 226000 226210 4 dorito corn chp supreme 200 650 380

It looks like this customer has only had the two transactions over the year and is not an ordinary retail customer. The customer might be buying chips for commercial purposes instead. We remove this loyalty card number from further analysis.

STORE QTY TOT_SALES PROD_SIZE
count 246740 246740 246740 246740
mean 135.05 1.90646 7.31611 175.584
std 76.787 0.342499 2.4749 59.4321
min 1 1 1.7 70
25% 70 2 5.8 150
50% 130 2 7.4 170
75% 203 2 8.8 175
max 272 5 29.5 380

That’s better. Now,looking at the number of transaction lines over time to see if there are any obvious data issues such as missing data, we see that is missing a value. There’s only 364 rows, meaning only 364 dates which indicates a missing date.

figure 1

Zooming in on the data to take a closer look:

We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day.

Now that we are satisfied that the data no longer has outliers, we can move on to creating other features such as brand of chips or pack size from PROD_NAME. We start with pack size.

The largest size is 380g and the smallest size is 70g - seems sensible!

Let’s plot a histogram of PACK_SIZE since we know that it is a categorical variable and not a continuous variable even though it is numeric.

figure3

3.3.2 Examining customer data

Now that we are happy with the transaction dataset, let’s have a look at the customer dataset.

LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
1000 YOUNG SINGLES/COUPLES Premium
1002 YOUNG SINGLES/COUPLES Mainstream
1003 YOUNG FAMILIES Budget
1004 OLDER SINGLES/COUPLES Mainstream
1005 MIDAGE SINGLES/COUPLES Mainstream

We will now join the two data using the python merge function.

DATE LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME QTY TOT_SALES SIZE LIFESTAGE PREMIUM_CUSTOMER
2018-10-17 1000 1 5 natural chip compny seasalt 2 6 175 YOUNG SINGLES/COUPLES Premium
2019-05-14 1307 348 66 ccs nacho cheese 3 6.3 175 MIDAGE SINGLES/COUPLES Budget
2019-05-20 1343 383 61 smiths crinkle cut chips chicken 2 2.9 170 MIDAGE SINGLES/COUPLES Budget
2018-08-17 2373 974 69 smiths chip thinly cream onion 5 15 175 MIDAGE SINGLES/COUPLES Budget
2018-08-18 2426 1038 108 kettle tortilla chpshny jlpno chili 3 13.8 150 MIDAGE SINGLES/COUPLES Budget

As the number of rows in ‘result’ is the same as that of ‘transactionData’, we can be sure that no duplicates were created. This is because we created ‘result’ by setting a left join which means take all the rows in ‘transactionData’ and find rows with matching values in shared columns and then join the details in these rows to the ‘x’ or the first mentioned table.

3.4 Data analysis on customer segments

Now that the data is ready for analysis, we can define some metrics of interest to the client: - Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is - How many customers are in each segment - How many chips are bought per customer by segment - What’s the average chip price by customer segment

Let’s start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and plotting the split by these segments to describe which customer segment contributes most to chip sales.

                                          TOT_SALES
PREMIUM_CUSTOMER LIFESTAGE                        
Budget           MIDAGE SINGLES/COUPLES   33345.70
                 NEW FAMILIES             20607.45
                 OLDER FAMILIES          156863.75
                 OLDER SINGLES/COUPLES   127833.60
                 RETIREES                105916.30
                 YOUNG FAMILIES          129717.95
                 YOUNG SINGLES/COUPLES    57122.10
Mainstream       MIDAGE SINGLES/COUPLES   84734.25
                 NEW FAMILIES             15979.70
                 OLDER FAMILIES           96413.55
                 OLDER SINGLES/COUPLES   124648.50
                 RETIREES                145168.95
                 YOUNG FAMILIES           86338.25
                 YOUNG SINGLES/COUPLES   147582.20
Premium          MIDAGE SINGLES/COUPLES   54443.85
                 NEW FAMILIES             10760.80
                 OLDER FAMILIES           75242.60
                 OLDER SINGLES/COUPLES   123537.55
                 RETIREES                 91296.65
                 YOUNG FAMILIES           78571.70
                 YOUNG SINGLES/COUPLES    39052.30

Plotting the bar graph of the data, we have:

Sales are coming mainly from Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees

Let’s see if the higher sales are due to there being more customers who buy chips.

                                         TOT_SALES
PREMIUM_CUSTOMER LIFESTAGE                        
Budget           MIDAGE SINGLES/COUPLES   35514.80
                 NEW FAMILIES             21928.45
                 OLDER FAMILIES          168363.25
                 OLDER SINGLES/COUPLES   136769.80
                 RETIREES                113147.80
                 YOUNG FAMILIES          139345.85
                 YOUNG SINGLES/COUPLES    61141.60
Mainstream       MIDAGE SINGLES/COUPLES   90803.85
                 NEW FAMILIES             17013.90
                 OLDER FAMILIES          103445.55
                 OLDER SINGLES/COUPLES   133393.80
                 RETIREES                155677.05
                 YOUNG FAMILIES           92788.75
                 YOUNG SINGLES/COUPLES   157621.60
Premium          MIDAGE SINGLES/COUPLES   58432.65
                 NEW FAMILIES             11491.10
                 OLDER FAMILIES           81958.40
                 OLDER SINGLES/COUPLES   132263.15
                 RETIREES                 97646.05
                 YOUNG FAMILIES           84025.50
                 YOUNG SINGLES/COUPLES    41642.10

Plotting again:

There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget - Older families segment.

Higher sales may also be driven by more units of chips being bought per customer.

Let’s have a look at this next.

Mainstream mid aged and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own consumption. This is also supported by there being fewer premium mid aged and young singles and couples buying chips compared to their mainstream counterparts.

As the difference in average price per unit isn’t large, we can check if this difference is statistically different.

The t-test results < 2.2e-16, i.e. the unit price for mainstream, young and mid-age singles and couples are significantly higher than that of budget or premium, young and mid age singles and couples.

3.5 Deep dive into specific customer segments for insights

We have found quite a few interesting insights that we can dive deeper into. We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let’s look at Mainstream - young singles/couples. For instance, let’s find out if they tend to buy a particular brand of chips.

MIDAGE SINGLES/COUPLES
smiths crinkle chips salt   vinegar     194
cheezels cheese                         186
doritos corn chips  nacho cheese        179
kettle chilli                           179
cobs popd sour crm   chives chips       176

YOUNG SINGLES/COUPLES
tostitos splash of  lime                335
kettle mozzarella   basil   pesto       332
doritos corn chips  cheese supreme      326
smiths crnkle chip  orgnl big bag       323
kettle tortilla chpshny jlpno chili     323

We can see that : - Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population - Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest of the population

Let’s also find out if our target segment tends to buy large packs of chips.

MIDAGE SINGLES/COUPLES
175.0    2975
150.0    1777
134.0    1159
110.0    1124
170.0     882

MIDAGE SINGLES/COUPLES
175.0    2975
150.0    1777
134.0    1159
110.0    1124
170.0     882

Both the segment buy 175g, 150g and 134g packets mostly

3.6 Conclusion

Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behavior.

We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and couples frequent more often to increase visibility and impulse behavior.

Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work on measuring the impact of trials in the next task and putting all these together in the third task.