Simple Linear Regression: Lego Data Set

Bonnie Cooper


The Lego Data Set

    The following code will walk through a simple linear regression fit with using r tidyverse methods and the Lego Sets dataset from kaggle.com. This dataset holds data features for lego sets available through lego.com including the number of pieces in a set, listed sale price, suggested age and other information available as a downloadable .csv file. We will explore this dataset, but first to some housekeeping to set up our r environment.
    Import the libraries we will utilize:

    Great!, now to load the lego_sets.csv as an r dataframe. To make out lives easier, the data has be uploaded to the author’s github account and will be accessed via a url link to the raw data:

## [1] 12261    14
##  [1] "ages"              "list_price"        "num_reviews"      
##  [4] "piece_count"       "play_star_rating"  "prod_desc"        
##  [7] "prod_id"           "prod_long_desc"    "review_difficulty"
## [10] "set_name"          "star_rating"       "theme_name"       
## [13] "val_star_rating"   "country"

    From the output above, we can see that there are 12,261 rows (entries) of data. Each row represents a lego set available on lego.com. There are 14 columns of data and the names have been printed above. From this we can begin to guess what the data might look like. For example, list_price is probably a column of floats giving the suggested sale price for a given lego set. We could use the head() function to view the first row several entries to the dataframe. However, instead we will use the glimpse() function. glimpse() will return a transposed version of the printed dataframe such that each row of the result will correspond to a column and will list as many entries that will fit along with the data type. This output is more consice and will give a more intuitive feel for the data than printing the frame of using head() would provide.

## Rows: 12,261
## Columns: 14
## $ ages              <chr> "6-12", "6-12", "6-12", "12+", "12+", "12+", "12+",…
## $ list_price        <dbl> 29.99, 19.99, 12.99, 99.99, 79.99, 59.99, 59.99, 49…
## $ num_reviews       <dbl> 2, 2, 11, 23, 14, 7, 37, 24, 23, 11, 14, 53, 7, 63,…
## $ piece_count       <dbl> 277, 168, 74, 1032, 744, 597, 598, 780, 468, 444, 3…
## $ play_star_rating  <dbl> 4.0, 4.0, 4.3, 3.6, 3.2, 3.7, 3.7, 4.4, 3.6, 3.6, 4…
## $ prod_desc         <chr> "Catapult into action and take back the eggs from t…
## $ prod_id           <dbl> 75823, 75822, 75821, 21030, 21035, 21039, 21028, 21…
## $ prod_long_desc    <chr> "Use the staircase catapult to launch Red into the …
## $ review_difficulty <chr> "Average", "Easy", "Easy", "Average", "Challenging"…
## $ set_name          <chr> "Bird Island Egg Heist", "Piggy Plane Attack", "Pig…
## $ star_rating       <dbl> 4.5, 5.0, 4.3, 4.6, 4.6, 4.9, 4.2, 4.7, 4.7, 4.8, 4…
## $ theme_name        <chr> "Angry Birds™", "Angry Birds™", "Angry Birds™", "Ar…
## $ val_star_rating   <dbl> 4.0, 4.0, 4.1, 4.3, 4.1, 4.4, 4.1, 4.3, 4.1, 4.5, 3…
## $ country           <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US…

    The glimpse() output is quite informative. We can get a good idea which features are categorical (e.g. country, ages, review_difficulty ) and which are numeric variables (e.g. price, piece_count, num_reviews etc.). To get more information about each features, we can also use the summary() method:

##      ages             list_price        num_reviews      piece_count    
##  Length:12261       Min.   :   2.272   Min.   :  1.00   Min.   :   1.0  
##  Class :character   1st Qu.:  19.990   1st Qu.:  2.00   1st Qu.:  97.0  
##  Mode  :character   Median :  36.588   Median :  6.00   Median : 216.0  
##                     Mean   :  65.142   Mean   : 16.83   Mean   : 493.4  
##                     3rd Qu.:  70.192   3rd Qu.: 13.00   3rd Qu.: 544.0  
##                     Max.   :1104.870   Max.   :367.00   Max.   :7541.0  
##                                        NA's   :1620                     
##  play_star_rating  prod_desc            prod_id        prod_long_desc    
##  Min.   :1.000    Length:12261       Min.   :    630   Length:12261      
##  1st Qu.:4.000    Class :character   1st Qu.:  21034   Class :character  
##  Median :4.500    Mode  :character   Median :  42069   Mode  :character  
##  Mean   :4.338                       Mean   :  59837                     
##  3rd Qu.:4.800                       3rd Qu.:  70922                     
##  Max.   :5.000                       Max.   :2000431                     
##  NA's   :1775                                                            
##  review_difficulty    set_name          star_rating     theme_name       
##  Length:12261       Length:12261       Min.   :1.800   Length:12261      
##  Class :character   Class :character   1st Qu.:4.300   Class :character  
##  Mode  :character   Mode  :character   Median :4.700   Mode  :character  
##                                        Mean   :4.514                     
##                                        3rd Qu.:5.000                     
##                                        Max.   :5.000                     
##                                        NA's   :1620                      
##  val_star_rating   country         
##  Min.   :1.000   Length:12261      
##  1st Qu.:4.000   Class :character  
##  Median :4.300   Mode  :character  
##  Mean   :4.229                     
##  3rd Qu.:4.700                     
##  Max.   :5.000                     
##  NA's   :1795

    The output from summary() gives us a better understanding of the spread of numerical features.

    We can now use what we have learned about the structure of the data to ask questions of our dataset. For example, how many Star Wars lego sets are there?:

## [1] 1377   14
##    list_price     
##  Min.   :   9.99  
##  1st Qu.:  27.99  
##  Median :  42.69  
##  Mean   :  91.50  
##  3rd Qu.: 102.00  
##  Max.   :1104.87

    OK. So there are 1377 Star Wars lego sets on the market. The average price (mean) is $91.50, but the most expensive is $1104.87! interesting, but not what we came here for.

Fitting a Simple Linear Regression

    We would like to explore the data to see if there is a relationship between the list_price of a lego set and the piece_count. In other words, does the price increase as a function of the number of pieces in a lego set? Furthermore, how well can this relationship be described by a linear model?

    Here, use the lm() function to fit a linear model to a training subset of the lego data. Specifically, we would like to return an “lm” object that describes how list_price is modelled by piece_count (`list_price ~ piece_count’)

## 
## Call:
## lm(formula = list_price ~ piece_count, data = train_lego)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.15  -14.49   -6.55    6.61  648.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.716e+01  5.627e-01    30.5   <2e-16 ***
## piece_count 9.773e-02  5.981e-04   163.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.23 on 9194 degrees of freedom
## Multiple R-squared:  0.7438, Adjusted R-squared:  0.7438 
## F-statistic: 2.67e+04 on 1 and 9194 DF,  p-value: < 2.2e-16

    The summary() output gives us a lot of information about the resulting linear model. For instance, the coefficients tell us that the linear model that best describes the fit to the data is given by the equation: \[list\_price = 17.16 + 0.0977 * piece\_count\]     This tells us that there is a positive relationship between list_price and piece_count and that for every additional piece, the price is modeled to increase by approximately 10 cents. The y-intercept, however, has no real world meaning (there would never be a lego set with 0 pieces on sale for $17.16). Additionally, from the \(R^2\) value, we see that the linear model explains about 74% of the test data’s variance and there is a quite small p-values suggesting a good fit.

    Now to visualize the training data set and the fit of the linear model:

    At a glance, the linear model appears to fit the training data points rather well. However, next we will evaluate the model’s ability to describe the test data that was withheld from the fitting process.

    The figure above shows the linear model’s fit to the test data and is annotated to show the \(R^2 = 0.792\) which suggests that the model does well at capturing much of the variance in the test data. Actually does slightly better with the test compared to the training.

    The next step is to use some diagnostics to evaluate whether a linear model is appropriate for this data. For this, the autoplot() methods from the ggfortify library will be used

    The diagnostic plots above are different visualizations that tell us about the appropriateness of a linear model for out dataset.

  1. Residuals vs Fitted: this plots the residual values against the fitted. Ideally, the data points should plot symmetrically about the Residuals = 0 line. However, we see a pattern where the data deviates as a function of fitted values.
  2. Normal Q-Q ideally, the values should plot along the dashed lines. However, there are some big deviations at very small and very large tails of the data suggesting outliers
  3. Scale-Location tells whether the residuals spreads equally for the range of fitted values. this appears to be roughly the case.
  4. Residuals vs leverage: this helps detect outliers. the tilt in the graph suggests that there are outliers that are having a big influence on the fit

    In conclusion, a linear models describes quite a bit of the variance on the data. However, the diagnostics suggest that there are outliers that throw off the fit. The next step might be to clean some of these values out from the data and re-evaluate.