The following code will walk through a simple linear regression fit with using r tidyverse methods and the Lego Sets dataset from kaggle.com. This dataset holds data features for lego sets available through lego.com including the number of pieces in a set, listed sale price, suggested age and other information available as a downloadable .csv file. We will explore this dataset, but first to some housekeeping to set up our r environment.
Import the libraries we will utilize:
library( tidyverse ) #tidyverse to model our data & other functions
library( tidymodels )
library( rsample ) #facilitates splitting data in train/test
library( ggplot2 ) #basic plotting of our lin. model fit + data
library( ggfortify ) #plot diagnostics of our lin. model Great!, now to load the lego_sets.csv as an r dataframe. To make out lives easier, the data has be uploaded to the author’s github account and will be accessed via a url link to the raw data:
lego_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/lego_sets.csv'
lego_df <- read.csv( lego_url ) #read data in as an r dataframe
dim( lego_df ) #print the dimensions of the dataset## [1] 12261 14
## [1] "ages" "list_price" "num_reviews"
## [4] "piece_count" "play_star_rating" "prod_desc"
## [7] "prod_id" "prod_long_desc" "review_difficulty"
## [10] "set_name" "star_rating" "theme_name"
## [13] "val_star_rating" "country"
From the output above, we can see that there are 12,261 rows (entries) of data. Each row represents a lego set available on lego.com. There are 14 columns of data and the names have been printed above. From this we can begin to guess what the data might look like. For example, list_price is probably a column of floats giving the suggested sale price for a given lego set. We could use the head() function to view the first row several entries to the dataframe. However, instead we will use the glimpse() function. glimpse() will return a transposed version of the printed dataframe such that each row of the result will correspond to a column and will list as many entries that will fit along with the data type. This output is more consice and will give a more intuitive feel for the data than printing the frame of using head() would provide.
## Rows: 12,261
## Columns: 14
## $ ages <chr> "6-12", "6-12", "6-12", "12+", "12+", "12+", "12+",…
## $ list_price <dbl> 29.99, 19.99, 12.99, 99.99, 79.99, 59.99, 59.99, 49…
## $ num_reviews <dbl> 2, 2, 11, 23, 14, 7, 37, 24, 23, 11, 14, 53, 7, 63,…
## $ piece_count <dbl> 277, 168, 74, 1032, 744, 597, 598, 780, 468, 444, 3…
## $ play_star_rating <dbl> 4.0, 4.0, 4.3, 3.6, 3.2, 3.7, 3.7, 4.4, 3.6, 3.6, 4…
## $ prod_desc <chr> "Catapult into action and take back the eggs from t…
## $ prod_id <dbl> 75823, 75822, 75821, 21030, 21035, 21039, 21028, 21…
## $ prod_long_desc <chr> "Use the staircase catapult to launch Red into the …
## $ review_difficulty <chr> "Average", "Easy", "Easy", "Average", "Challenging"…
## $ set_name <chr> "Bird Island Egg Heist", "Piggy Plane Attack", "Pig…
## $ star_rating <dbl> 4.5, 5.0, 4.3, 4.6, 4.6, 4.9, 4.2, 4.7, 4.7, 4.8, 4…
## $ theme_name <chr> "Angry Birds™", "Angry Birds™", "Angry Birds™", "Ar…
## $ val_star_rating <dbl> 4.0, 4.0, 4.1, 4.3, 4.1, 4.4, 4.1, 4.3, 4.1, 4.5, 3…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US…
The glimpse() output is quite informative. We can get a good idea which features are categorical (e.g. country, ages, review_difficulty ) and which are numeric variables (e.g. price, piece_count, num_reviews etc.). To get more information about each features, we can also use the summary() method:
## ages list_price num_reviews piece_count
## Length:12261 Min. : 2.272 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.: 19.990 1st Qu.: 2.00 1st Qu.: 97.0
## Mode :character Median : 36.588 Median : 6.00 Median : 216.0
## Mean : 65.142 Mean : 16.83 Mean : 493.4
## 3rd Qu.: 70.192 3rd Qu.: 13.00 3rd Qu.: 544.0
## Max. :1104.870 Max. :367.00 Max. :7541.0
## NA's :1620
## play_star_rating prod_desc prod_id prod_long_desc
## Min. :1.000 Length:12261 Min. : 630 Length:12261
## 1st Qu.:4.000 Class :character 1st Qu.: 21034 Class :character
## Median :4.500 Mode :character Median : 42069 Mode :character
## Mean :4.338 Mean : 59837
## 3rd Qu.:4.800 3rd Qu.: 70922
## Max. :5.000 Max. :2000431
## NA's :1775
## review_difficulty set_name star_rating theme_name
## Length:12261 Length:12261 Min. :1.800 Length:12261
## Class :character Class :character 1st Qu.:4.300 Class :character
## Mode :character Mode :character Median :4.700 Mode :character
## Mean :4.514
## 3rd Qu.:5.000
## Max. :5.000
## NA's :1620
## val_star_rating country
## Min. :1.000 Length:12261
## 1st Qu.:4.000 Class :character
## Median :4.300 Mode :character
## Mean :4.229
## 3rd Qu.:4.700
## Max. :5.000
## NA's :1795
The output from summary() gives us a better understanding of the spread of numerical features.
We can now use what we have learned about the structure of the data to ask questions of our dataset. For example, how many Star Wars lego sets are there?:
## [1] 1377 14
## list_price
## Min. : 9.99
## 1st Qu.: 27.99
## Median : 42.69
## Mean : 91.50
## 3rd Qu.: 102.00
## Max. :1104.87
OK. So there are 1377 Star Wars lego sets on the market. The average price (mean) is $91.50, but the most expensive is $1104.87! interesting, but not what we came here for.
We would like to explore the data to see if there is a relationship between the list_price of a lego set and the piece_count. In other words, does the price increase as a function of the number of pieces in a lego set? Furthermore, how well can this relationship be described by a linear model?
Here, use the lm() function to fit a linear model to a training subset of the lego data. Specifically, we would like to return an “lm” object that describes how list_price is modelled by piece_count (`list_price ~ piece_count’)
#split data into train and test sets.
set.seed( 138 )
lego_split <- initial_split( lego_df, prop = 3/4 )
train_lego <- training( lego_split )
test_lego <- testing( lego_split )
#fit a linear regression model to the training subset of the data
linregmod <- lm( list_price ~ piece_count, data = train_lego )
summary( linregmod ) #return a summary of the linear model##
## Call:
## lm(formula = list_price ~ piece_count, data = train_lego)
##
## Residuals:
## Min 1Q Median 3Q Max
## -272.15 -14.49 -6.55 6.61 648.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.716e+01 5.627e-01 30.5 <2e-16 ***
## piece_count 9.773e-02 5.981e-04 163.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.23 on 9194 degrees of freedom
## Multiple R-squared: 0.7438, Adjusted R-squared: 0.7438
## F-statistic: 2.67e+04 on 1 and 9194 DF, p-value: < 2.2e-16
The summary() output gives us a lot of information about the resulting linear model. For instance, the coefficients tell us that the linear model that best describes the fit to the data is given by the equation: \[list\_price = 17.16 + 0.0977 * piece\_count\] This tells us that there is a positive relationship between list_price and piece_count and that for every additional piece, the price is modeled to increase by approximately 10 cents. The y-intercept, however, has no real world meaning (there would never be a lego set with 0 pieces on sale for $17.16). Additionally, from the \(R^2\) value, we see that the linear model explains about 74% of the test data’s variance and there is a quite small p-values suggesting a good fit.
Now to visualize the training data set and the fit of the linear model:
modpredict <- linregmod %>% predict( train_lego )
ggplot( train_lego, aes( y= list_price, x = piece_count ) ) +
geom_point() +
geom_line( aes( y = modpredict )) +
ggtitle( 'Visualization of Training Data & Linear Model fit' )At a glance, the linear model appears to fit the training data points rather well. However, next we will evaluate the model’s ability to describe the test data that was withheld from the fitting process.
modpredict <- linregmod %>% predict( test_lego )
test_rsq <- rsq_vec( test_lego$list_price, modpredict )
rsq_lbl <- paste( 'R^2 =', round( test_rsq,3 ) )
ggplot( test_lego, aes( y= list_price, x = piece_count ) ) +
geom_point() +
geom_line( aes( y = modpredict )) +
ggtitle( 'Visualization of Test Data & Linear Model fit' ) +
geom_text( x = 200, y = 900, label=rsq_lbl ) The figure above shows the linear model’s fit to the test data and is annotated to show the \(R^2 = 0.792\) which suggests that the model does well at capturing much of the variance in the test data. Actually does slightly better with the test compared to the training.
The next step is to use some diagnostics to evaluate whether a linear model is appropriate for this data. For this, the autoplot() methods from the ggfortify library will be used
The diagnostic plots above are different visualizations that tell us about the appropriateness of a linear model for out dataset.
In conclusion, a linear models describes quite a bit of the variance on the data. However, the diagnostics suggest that there are outliers that throw off the fit. The next step might be to clean some of these values out from the data and re-evaluate.