Introduction

Here it is! My first real-world experience with data analysis. Up to this point, everything I’ve done involving data analysis has been in a guided, educational context. My previous experience has been helpful in providing me with an outline of how to proceed, but in order to fully understand the process, hands-on practice on an authentic data set is essential. So I spoke with my employer about gaining access to the company’s sales records with the goal of building a model to minimize loss incurred by wasted product. He granted me access to all the sales records since the company was founded in 2012!

Background: The data comes from Local Lion, a coffee and donut shop in Boone, NC where I’ve worked since 2013. The resident population of Boone is matched (if not surpassed) by the student population of local Appalachian State University. As a result, students constitute a proportionally large percentage of our client base; I suspect that their absence during the summer has a significant impact on our sales. This notion is fundamental to my goal: to build a model to predict the number of donuts we need to make each morning based on a variety of external factors from weather to student presence. (Note: We sell a variety of different types of donuts, but for this I focus primarily on our yeast donuts – these are by and large our best-selling donut.)

Let’s dive in!

Loading and Tidying data

When I initially surveyed the data, I found that it was saved in annual sales reports. I filtered the reports to only grab the sales relevant to our yeast donuts and found that each file needed to be transposed before proceeding in R. I transposed each file in excel and combined them into one .csv so I wouldn’t have to repeat the read-in code for each file.

donuts <- read.csv("./tidydonuts.csv")
glimpse(donuts)
Observations: 2,133
Variables: 4
$ Date      <fctr> 2/28/2013, 3/1/2013, 3/2/2013, 3/3/2013, 3/4/2013, ...
$ Single    <fctr> $16.40 , $9.84 , $0.00 , $0.00 , $0.00 , $0.00 , $1...
$ Dozen     <fctr> $0.00 , $0.00 , $0.00 , $0.00 , $0.00 , $0.00 , $0....
$ HalfDozen <fctr> $0.00 , $0.00 , $0.00 , $0.00 , $0.00 , $0.00 , $0....

Ah, the .csv file imported each variable as a factor. Let me reclassify the data.

donuts$Date <- as.Date(donuts$Date, format = "%m/%d/%Y")
donuts$Dozen <- as.numeric(sub('\\$','',as.character(donuts$Dozen)))
donuts$Single <- as.numeric(sub('\\$','',as.character(donuts$Single)))
donuts$HalfDozen <- as.numeric(sub('\\$','',as.character(donuts$HalfDozen)))
glimpse(donuts)
Observations: 2,133
Variables: 4
$ Date      <date> 2013-02-28, 2013-03-01, 2013-03-02, 2013-03-03, 201...
$ Single    <dbl> 16.40, 9.84, 0.00, 0.00, 0.00, 0.00, 18.04, 47.56, 1...
$ Dozen     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 44.2...
$ HalfDozen <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.2, 8.2, 0....

Now, as you can see, we have 3 menu item relevant to yeast donuts: Single, Dozen and HalfDozen. These each report the total sales regarding each menu item. I’m mainly interested in the total number of yeast donuts sold daily, so let’s create a variable that reports that.

donuts <- donuts %>% 
  mutate(Total = as.integer(((Dozen/14.75)*12) + (Single/1.64) + ((HalfDozen/8.2)*6)))
glimpse(donuts)
Observations: 2,133
Variables: 5
$ Date      <date> 2013-02-28, 2013-03-01, 2013-03-02, 2013-03-03, 201...
$ Single    <dbl> 16.40, 9.84, 0.00, 0.00, 0.00, 0.00, 18.04, 47.56, 1...
$ Dozen     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 44.2...
$ HalfDozen <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.2, 8.2, 0....
$ Total     <int> 10, 6, 0, 0, 0, 0, 11, 29, 131, 95, 0, 77, 93, 87, 7...

Exploratory Data Analysis

Now that the data set is in order and ready for action, I want to take a little time to answer a few questions that I’ve always been curious about. This will also help me get to know the data set better. For example, what’s our high score?

        Date Single  Dozen HalfDozen Total
1 2018-06-01  265.5 637.96     254.2   866

Wow, almost 1,000 yeast donuts in one day! Guess what June 1st is…National Donut Day.
What were the other highest values?

        Date Single  Dozen HalfDozen Total
1 2018-06-01 265.50 637.96     254.2   866
2 2017-01-27 206.50 541.20     155.8   680
3 2017-05-12 132.75 536.28     196.8   661
4 2018-08-17 250.75 477.24     164.0   661
5 2018-10-19 265.50 465.76     147.6   648
6 2015-06-05 368.75 401.80     131.2   647

Now, another big question…what’s the total number of yeast donuts we’ve sold?

  Total Sold Total Sales
1     458085    612554.1

Wow! Almost half a million donuts. That value doesn’t account for the 5 dozen we send daily to Appalachian State University (or our lax data entry), so the real figure’s somewhat higher.

Data Visualization and Investigation

This scatterplot reflects the number of donuts sold each day. The horizontal line at the bottom initially struck me as peculiar, but I quickly realized that’s due to the fact that we’re closed on Sundays. There’s also a marked increase in sales in mid-late 2014. After mentioning it to the people at Local Lion, we determined that this shift coincides with an intensive remodel that occured around that time. I guess it paid off!

Now, ultimately I’d like to make a model that predicts how many yeast donuts we should make on a given day considering a number of external factors (weather, student presence, etc.). So let’s begin by adding some additional data to the dataframe!

Let’s begin by making a column for month and a column for season!

donuts_ms <- donuts %>%
  mutate(Year = ordered(as.factor(paste(year(Date)))),
         Month = ordered(as.factor(paste(month(Date), year(Date), sep = "/"))),
         Season = as.factor(ifelse(Month %in% c("8/2013", "9/2013", "10/2013", "8/2014", "9/2014", "10/2014", "8/2015","9/2015", "10/2015", "8/2016", "9/2016", "10/2016", "8/2017", "9/2017", "10/2017", "8/2018", "9/2018", "10/2018"), "Fall",
                  ifelse(Month %in% c("11/2013", "12/2013", "1/2013", "2/2013","11/2014", "12/2014", "1/2014", "2/2014", "11/2015", "12/2015", "1/2015", "2/2015", "11/2016", "12/2016", "1/2016", "2/2016", "11/2017", "12/2017", "1/2017", "2/2017", "11/2018", "12/2018", "1/2018", "2/2018"), "Winter",
                  ifelse(Month %in% c("3/2013", "4/2013", "3/2014", "4/2014 ", "3/2015", "4/2015", "3/2016", "4/2016", "3/2017", "4/2017", "3/2018", "4/2018"), "Spring", "Summer")))))

You may have noticed that I only coded two months as “Spring”–this will allow me to capture the majority of summer break in the “Summer” season and accurately determine if the corresponding student absence truly has an effect on our donut sales. Let’s do some quick data visualization to see if this is the case!

Wow! Much to my surprise, donut sales in the summer don’t appear to lag at all, one could even make a tentative claim that they sell slightly better than in the spring. I suppose that the tourist influx mitigates any dip in business from student absence.

donuts_ms %>% 
  ggplot(aes(Date, Total, group = Season)) +
  geom_boxplot() +
  facet_wrap(~ Season)

There doesn’t appear to be any significant difference in donut sales based on season.

Let’s check the number of donuts sold by season.

# A tibble: 4 x 2
  Season  Total
  <fctr>  <int>
1   Fall 135640
2 Spring  68033
3 Summer 123027
4 Winter 131385

Spring appears to be a slump season, but I readily attribute that to the way I coded seasons.

Discussion

I set out to build a practical linear model to predict the number of yeast donuts the shop would need to make on a given day based on a handful of external factors. However, the “data visualization and investigation” section revealed no effect of season on donut sales. This means that summer break at the university, and the subsequent absence of the student population (which accounts for roughly half of Boone’s overall population), has no effect on Local Lion’s donut sales. Proving the contrary was fundamental to my proposed model, so I’m going to have to table the model-building for a bit. However, from this discovery I was able to infer that Boone’s tourism industry is significant enough to compensate for nearly half of the population. During my 7 years in Boone, I had no idea that the tourist economy was that significant! A conversation could be had regarding the students’ actual buying power, but for now it simply spells the end of my model-building exercise.
Now I’m going to continue to dig in the data to answer other questions I’ve had.

Have we sold more donuts or coffee?

s13 <- read.csv("./s13.csv")
s14 <- read.csv("./s14.csv")
s15 <- read.csv("./s15.csv")
s16 <- read.csv("./s16.csv")
s17 <- read.csv("./s17.csv")
s18 <- read.csv("./s18.csv")
glimpse(s18)
Observations: 81
Variables: 11
$ Item.Name         <fctr> Custom Amount, 16.9 Oz San Pellegrino, 2 Do...
$ Item.Variation    <fctr> No description, Regular Price, Regular Pric...
$ SKU               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ Category          <fctr> Uncategorized, Uncategorized, Uncategorized...
$ Items.Sold        <int> 1125, 239, 1939, 9232, 6822, 9, 6424, 44, 19...
$ Gross.Sales       <fctr> $18,732.53, $717.00, $484.75, $5,169.92, $3...
$ Items.Refunded    <int> -5, -1, -3, -6, -8, 0, -6, 0, 0, 0, 0, 0, 0,...
$ Refunds           <fctr> $-437.99, $-3.00, $-0.75, $-3.36, $-4.48, $...
$ Discounts...Comps <fctr> $-35.40, $-9.00, $0.00, $-12.22, $-3.74, $-...
$ Net.Sales         <fctr> $18,259.14, $705.00, $484.00, $5,154.34, $3...
$ Tax               <fctr> $273.46, $46.90, $34.38, $362.64, $271.17, ...
s18 <- s18 %>% 
  select(Item.Name, Items.Sold, Net.Sales)
s18$Item.Name <- as.character(s18$Item.Name)
s18$Net.Sales <- as.numeric(gsub('[$,]', '', s18$Net.Sales))
glimpse(s18)
Observations: 81
Variables: 3
$ Item.Name  <chr> "Custom Amount", "16.9 Oz San Pellegrino", "2 Dough...
$ Items.Sold <int> 1125, 239, 1939, 9232, 6822, 9, 6424, 44, 19, 316, ...
$ Net.Sales  <dbl> 18259.14, 705.00, 484.00, 5154.34, 3812.10, 34.80, ...
s18 <- s18[-c(1,2,4:6,8:11,33,37,39,47,51,52,60,65,66,68,70,71,73:76,79),]
s18$Item.Name
 [1] "2 Doughnut Holes"       "Americano (double)"    
 [3] "Breve (12oz)"           "Breve (16oz)"          
 [5] "Breve (20oz)"           "Cafe Au Lait (12oz)"   
 [7] "Cafe Au Lait (16oz)"    "Cafe Au Lait (20oz)"   
 [9] "Cappuccino Traditional" "Chai Latte (Large)"    
[11] "Chai Latte (Medium)"    "Chai Latte (Small)"    
[13] "Coffee lb"              "Cold Brew 16oz"        
[15] "Cold Brewwww"           "Cortado"               
[17] "Cubano"                 "D- (Dozen)"            
[19] "D- Cake"                "D- Cake Dozen"         
[21] "D- Filled Specialty"    "D- Holes Cup"          
[23] "Doughnut Yeast"         "Drip (12oz.)"          
[25] "Drip (16oz.)"           "Drip (20oz)"           
[27] "Espresso Double"        "Frappe (16oz.)"        
[29] "Fresh Red"              "Group Coffee "         
[31] "Half Dozen (yeast)"     "Hazel Truffle (12oz)"  
[33] "Hazel Truffle (16oz)"   "Hazel Truffle (20oz)"  
[35] "Hot Chocolate (12oz)"   "Hot Chocolate (16oz)"  
[37] "Hot Chocolate (20oz)"   "Latte 12oz"            
[39] "Latte 16oz"             "Latte 20oz"            
[41] "London Fog (12oz)"      "London Fog (16oz)"     
[43] "London Fog (20oz)"      "Macchiato"             
[45] "Mo Qal Beeb LV (12oz)"  "Mo Qal Beeb LV (16oz)" 
[47] "Mo Qal Beeb LV (20oz)"  "Muffin"                
[49] "Nitro 16oz"             "Pour Over"             
[51] "Special One"            "Lotus/Biscuit"         
[53] "Muffin"                 "Smoothie (16oz.)"      
[55] "Truffle"               
s18$Desserts <- as.integer(c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1))
glimpse(s18)
Observations: 55
Variables: 4
$ Item.Name  <chr> "2 Doughnut Holes", "Americano (double)", "Breve (1...
$ Items.Sold <int> 1939, 6424, 271, 225, 140, 118, 186, 121, 1962, 918...
$ Net.Sales  <dbl> 484.00, 13525.33, 1013.99, 894.30, 590.80, 276.12, ...
$ Desserts   <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
s18 %>% 
filter(Desserts == 1) %>% 
summarize("Total Sold" = sum(Items.Sold),
          "Net Sales" = sum(Net.Sales))
  Total Sold Net Sales
1      70472    154338
s18 %>% 
filter(Desserts == 0) %>% 
summarize("Total Sold" = sum(Items.Sold),
          "Net Sales" = sum(Net.Sales))
  Total Sold Net Sales
1     100715  329468.3